0% found this document useful (0 votes)

130 views93 pages

Customer Course Catalog

Databricks Academy 2022

Uploaded by

eli

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

130 views93 pages

Customer Course Catalog

Databricks Academy 2022

Uploaded by

eli

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 93

Q4 2022 Course

Catalog
UPDATED: JANUARY 2022

Welcome to the Databricks Academy 6

About the Databricks Academy 6

About this course catalog 6

What’s new this quarter 7

What’s being retired this quarter 7

Databricks Academy offerings 7

Training 7

Credentials 9

Learning paths 9

Business leaders 10

Platform administrators 11

SQL analysts 13

Data scientists 14

Data engineers 15

Course descriptions 16

Apache Spark Programming with Databricks 16

Applications of SQL on Databricks 17

Applied Data Engineering Solutions with Databricks 18

AWS Databricks Cloud Architecture and System Integration Fundamentals 19

AWS Databricks Cluster Usage Management 20

AWS Databricks Data Access Management 20

AWS Databricks Identity Access Management 21

AWS Databricks Security Fundamentals 22

AWS Databricks SQL Administration 23

AWS Databricks Workspace Deployment 23

Azure Databricks Cloud Architecture and System Integration Fundamentals 24

Azure Databricks Cluster Usage Management 25

Azure Databricks Data Access Management 26

Azure Databricks Identity Access Management 26

Azure Databricks Security Fundamentals 27

Azure Databricks SQL Administration 28

Azure Databricks Workspace Deployment 29

Cert Prep Course for the Databricks Certified Associate Developer for Apache Spark
Exam 29

Configuring Workspace Access Control Lists (ACLs) 30

Data Engineering with Databricks 31

Data Science on Databricks Rapid Start 33

Data Science on Databricks - Bias-Variance Tradeoff 33

Data Visualization with Databricks SQL 34

Databricks Command Line Interface (CLI) Fundamentals 35

Databricks Datadog Integration 36

Databricks on Google Cloud: Architecture and Security Fundamentals 36

Databricks on Google Cloud: Cloud Architecture and System Integration 37

Databricks on Google Cloud: Cluster Usage Management 38

Databricks on Google Cloud: Workspace Deployment 39

Databricks with R 39

Databricks Workspace Fundamentals for Business Analytics 40

Delta Lake Rapid Start with Python 40

Delta Lake Rapid Start with Spark SQL 41

Deploying a Machine Learning Project with MLflow Projects 42

Easy ETL with Auto Loader 43

Enterprise Architecture with Databricks 43

Fundamentals of Big Data 44

Fundamentals of Cloud Computing 45

Fundamentals of Databricks Machine Learning 46

Fundamentals of Databricks SQL 46

Fundamentals of Delta Lake 47

Fundamentals of Enterprise Data Management Systems 48

Fundamentals of Lakehouse Architecture 49

Fundamentals of Machine Learning 49

Fundamentals of Structured Streaming 50

Fundamentals of the Databricks Lakehouse Platform 51

Google Cloud Fundamentals 52

How to Code-Along with Self-Paced Courses 53

How to Manage Clusters in Databricks 53

Introduction to Apache Spark Architecture 54

Introduction to Applied Linear Models 54

Introduction to Applied Statistics 55

Introduction to Applied Tree-Based Models 56

Introduction to Applied Unsupervised Learning 56

Introduction to AutoML 57

Introduction to Cloning with Delta Lake 58

Introduction to Databricks Connect 58

Introduction to Databricks Machine Learning 59

Introduction to Databricks Repos 60

Introduction to Delta Lake 60

Introduction to Delta Live Tables 61

Introduction to Feature Engineering and Selection with Databricks 62

Introduction to Hyperparameter Optimization 63

Introduction to Jobs 63

Introduction to MLflow Model Registry 64

Introduction to Multi-Task Jobs 65

Introduction to Natural Language Processing 65

Introduction to SQL on Databricks 66

Just Enough Python for Apache Spark 67

Just Enough Scala for Apache Spark 68

Lakehouse with Delta Lake Deep Dive 69

Machine Learning in Production: MLflow and Model Deployment 70

Migrating SAS Procedures to Databricks 71

Natural Language Processing at Scale with Databricks 72

Optimizing Apache Spark on Databricks 72

Propagating Changes with Delta Change Data Feed 74

Quick Reference: CI/CD 75

Quick Reference: Databricks Workspace User Interface 75

Quick Reference: Managing Databricks Notebooks with the Databricks Workspace 76

Quick Reference: Relational Entities on Databricks 76

Quick Reference: Spark Architecture 77

Scalable Deep Learning with TensorFlow and Apache Spark™ 78

Scalable Machine Learning with Apache Spark 80

Setting up Databricks SQL 81

SQL Coding Challenges 82

Structured Streaming 82

Tracking Experiments with MLflow 83

What’s New in Apache Spark 3.0 84

Credential descriptions 84

Associate SQL Analyst Accreditation 84

Databricks Certified Associate Developer for Apache Spark 2.4 85

Databricks Certified Associate Developer for Apache Spark 3.0 86

Databricks Certified Associate ML Practitioner for Apache Spark 2.4 87

Databricks Certified Professional Data Scientist 88

Fundamentals of the Databricks Lakehouse Platform 89

Welcome to the Databricks Academy

About the Databricks Academy

Our mission at the Databricks Academy is to help our customers achieve their big
data and analytics goals through engaging learning experiences. At Databricks,
professionals from a wide variety of disciplines come together and use modern
pedagogical techniques to develop training that showcases Databricks best
practices. We offer our customers a wide range of materials to meet their diverse
training needs -- whether they want to study at home, participate in a traditional
classroom setting, or engage with other Databricks users in public online courses --
to grow professionally with cloud-native skills.

About this course catalog

This course catalog is broken into the following categories:

● Welcome to the Databricks Academy: information about the Databricks

Academy and the students we serve
● What’s new this quarter: a list of the recently released training materials
● Databricks Academy offerings: an explanation of the types of learning
content we offer
● Course descriptions: short descriptions for each course available through
the Databricks Academy
What’s new this quarter
November 2021

ELT with Spark SQL

How to Ingest Data for Databricks SQL

Introduction to Databricks Data Science & Engineering Workspace

Scaling Machine Learning Pipelines

December 2021

Basic SQL on Databricks SQL

Getting Started with Databricks SQL

Introduction to MLflow Tracking

New Capability Overview: Feature Store

What’s being retired this quarter

November 9: Quick Reference: Databricks Workspace User Interface (content now updated
and inside Introduction to Databricks Data Science & Engineering Workspace)

November 9: Quick Reference: Managing Databricks Notebooks with the Databricks

Workspace (content now updated and inside Introduction to Databricks Data Science &
Engineering Workspace)

December 22: How to Manage Clusters in Databricks (content now included in Introduction
to Databricks Data Science and Data Engineering Workspace)
Databricks Academy offerings

Training
Self-paced online courses - asynchronous virtual training available to
individuals through the Databricks Academy website. This type of training is
free for Databricks customers. Each course is typically less than one hour in
length.

Workshops - live one to three hour training sessions made available to

small groups, typically in a virtual format. Please contact your Databricks
Customer Success Engineer for more information on workshops.

Instructor-led training - live one to three day training sessions made

available to everyone - customers and the public, for a fee. These training
offerings are delivered virtually and on-site.
Credentials
Accreditations - low stakes credentials resulting from an unproctored
online exam administered through the Databricks Academy website. They
are earned after demonstrating mastery of technology areas at the
introductory level, and are in alignment with self-paced training.

Certifications - higher stakes credentials resulting from a proctored exam

administered through a testing vendor. They are earned after demonstrating
mastery of intermediate and advanced technical areas. They are in
alignment with instructor-led training, and are role-based. Unlike
accreditations, which are prepared for a general audience, certifications are
designed to align with data practitioner roles (for example, a data engineer
or a data analyst role).

Learning paths
The learning paths included below are designed to help guide users to the courses
most relevant to them.
Databricks Overview
Platform administration
Data analysis
Data science / machine learning

Data engineering
Course descriptions

Apache Spark Programming with

Databricks
Type: Instructor-led ($1,500 USD); Self-paced: Free for customers

Duration: 2 days (12 hours)

Course description: This course uses a case study-driven approach to explore the
fundamentals of Spark Programming with Databricks, including Spark architecture,
the DataFrame API, query optimization, and Structured Streaming. First, you will
become familiar with Databricks and Spark, recognize their major components, and
explore datasets for the case study using the Databricks environment. After
ingesting data from various file formats, you will process and analyze datasets by
applying a variety of DataFrame transformations, Column expressions, and built-in
functions. Lastly, you will execute streaming queries to process streaming data and
highlight the advantages of using Delta Lake.

Prerequisites:

● Familiarity with basic SQL concepts (select, filter, groupby, join, etc)
● Beginner programming experience with Python or Scala (syntax, conditions,
loops, functions)

Learning objectives:

● Identify core features of Spark and Databricks.

● Describe how DataFrames are created and evaluated in Spark.
● Apply DataFrame transformations to process and analyze data.
● Apply Structured Streaming to process streaming data.
● Explain fundamental Delta Lake concepts

Course agenda:

● Day 1: DataFrames
○ Introduction
○ Databricks Platform
○ Spark SQL
○ Reader & Writer
○ DataFrame & Column
● Day 2: Transformations
○ Aggregation
○ Datetimes
○ Complex Types
○ Additional Functions
○ User-Defined Functions
● Day 3: Spark Optimization
○ Spark Architecture
○ Shuffles and Caching
○ Query Optimization
○ Spark UI
○ Partitioning
● Day 4: Structured Streaming
○ Review
○ Streaming Query
○ Processing Streams
○ Aggregating Streams
○ Delta Lake

Applications of SQL on Databricks

Type: Self-paced; Cost: Free for customers

Duration: 1 hour

Course description: In the course Introduction to SQL on Databricks, we introduce

Spark and Spark SQL as a solution for using common SQL syntax when working with
structured or semi-structured data. In this course, you will use Spark SQL on
Databricks to practice common design patterns for efficiently creating new tables,
explore built-in functions that can help you explore, manipulate, and aggregate
nested data.

Prerequisites & Requirements

● Prerequisites
○ Basic SQL commands
○ Experience working with SQL in a Databricks notebook

Learning objectives

● Use optional arguments in CREATE TABLE to define data format and location
in a Databricks database
● Efficiently copy, modify and create new tables from existing ones
● Use built-in functions and features of Spark SQL to Manage and manipulate
nested objects
● Use roll-up, cube, and window functions to aggregate data and pivot tables

Applied Data Engineering Solutions with

Databricks
Type: Instructor-led ($1,500 USD); Self-paced: Free for customers

Duration: 2 days (12 hours)

Course description: In this course, formerly known as Advanced Data Engineering

with Databricks, participants will learn about advanced topics in building and
maintaining data engineering workloads on Databricks. This course highlights those
features of the Databricks Lakehouse platform that make it well-suited for
production data engineering, with an emphasis on Spark 3, Delta Lake, Structured
Streaming, and proprietary platform features.

Prerequisites:

● Advanced experience using Apache Spark

● Advanced experience coding with Python
● Intermediate experience writing SQL queries
● Intermediate experience using Databricks platform
● Intermediate experience using Delta Lake
● Intermediate experience using Structured Streaming
● Intermediate knowledge of data engineering concepts

Learning objectives:

● Design and implement multi-pipeline multi-hop architecture to enable the

Lakehouse paradigm.
● Deploy Structured Streaming operations that take advantage of Databricks
Job scheduling capabilities and avoid workspace limitations.
● Implement Databricks-native code that leverages platform-specific Delta
Lake features to simplify production workloads.
● Master design patterns that enable common use cases, including change data
capture (CDC), slowly changing dimensions (SCD), and managing personally
identifiable information (PII).

AWS Databricks Cloud Architecture and

System Integration Fundamentals
Type: Self-paced; Cost: Free for customers

Duration: 1 hour

Course description: While the Databricks Unified Analytics Platform provides a

broad range of functionality to many members of data teams, it is through
integrations with other services that most cloud-native applications will achieve
results desired by customers. This course is a series of demos designed to help
students understand the portions of cloud workloads appropriate for Databricks.
Within these demos, we'll highlight integrations with first-party services in the AWS
cloud to build scalable and secure applications.

Prerequisites:

● Beginning knowledge of Spark programming (reading/writing data, batch and

streaming jobs, transformations and actions)
● Beginning-level experience using Python or Scala to perform basic control
flow operations.
● Familiarity with navigation and resource configuration in the AWS Console.

Learning objectives:

● Describe use cases for Databricks in an enterprise cloud architecture.

● Configure secure connections from Databricks to data in S3.
● Configure connections between Databricks and various first-party tools in an
enterprise cloud architecture, including Redshift and Kinesis.
● Deploy an MLflow model to a Sagemaker endpoint for serving online model
predictions.
● Configure Glue as an enterprise data catalog

AWS Databricks Cluster Usage

Management
Type: Self-paced; Cost: Free for customers

Duration: 1.5 hours

Course description:In this course, you will first define computation resources
(clusters, jobs, and pools) and determine which resources to use for different
workloads. Then, you will learn cluster provisioning strategies for several use cases to
maximize usability and cost-effectiveness. You will also identify best practices for
cluster governance, including cluster policies. This course also covers capacity
limits, cost management, and chargeback analysis.

Prerequisites:

● Beginning experience using the Databricks workspace

● Beginning experience with Databricks administration

Learning objectives:

● Define computation resources (clusters, jobs, and pools) and determine

which resources to use for different workloads.
● Describe cluster provisioning strategies for several use cases to maximize
usability and cost effectiveness.
● Identify best practices for cluster governance, including cluster policies.
● Describe capacity limits on Azure Databricks.
● Describe how to manage costs and perform chargeback analysis.

AWS Databricks Data Access

Management
Type: Self-paced; Cost: Free for customers

Duration: 1 hour
Course description:In this course, you will learn about the Databricks File System
and Hive Metastore concepts. Then, you will apply best practices to secure access
to Amazon S3 from Databricks. Next, you will configure access control for data
objects including tables, databases, views, and functions. You will also apply column
and row-level permissions and data masking with dynamic views for multiple users
and groups. Lastly, you will identify methods for data isolation within your
organization on Databricks.

Prerequisites:

● Beginning experience with AWS Databricks security, including deployment

architecture and encryptions
● Beginning experience with AWS Databricks administration, including identity
management and workspace access control
● Beginning experience using the Databricks workspace
● Databricks Premium Plan

Learning objectives:

● Describe fundamental concepts about the Databricks File System and Hive
Metastore.
● Apply best practices to secure access to Amazon S3 from Databricks.
● Configure access control for data objects including tables, databases, views,
and functions.
● Apply column and row-level permissions and data masking with dynamic
views for multiple users and groups.
● Identify methods for data isolation within your organization on Databricks.

AWS Databricks Identity Access

Management
Type: Self-paced; Cost: Free for customers

Duration: 1.5 hours

Course description:In this course, you will learn how to manage user accounts and
groups in the Admin Console. You will also learn how to manage token-based
authentication and settings for your workspace, such as workspace storage and
additional cluster configurations. Lastly, this course covers access control for
workspace objects, such as notebooks and folders, in addition to clusters, pools, and
jobs.

Prerequisites:

● Experience using a web browser.

● Note: To perform the tasks shown in this course, you will need a Databricks
workspace deployment with administrator rights.

Learning objectives:

● Manage user accounts and groups in the Admin Console.

● Generate and manage personal access tokens for authentication.
● Enable additional cluster configurations and purge deleted objects from
workspace storage.
● Configure access control for workspace objects, such as notebooks and
folders.
● Configure access control for clusters, pools, and jobs.

AWS Databricks Security Fundamentals

Type: Self-paced; Cost: Free for customers

Duration: 1 hour

Course description: This course provides an overview of essential security features

to consider when managing your AWS Databricks workspace. You will start by
identifying components of the Databricks platform architecture and deployment
models. Then, you will define several features regarding network security including
no public IPs, Bring Your Own VPC, VPC peering, and IP access lists. After recognizing
IdP integrations, you will explore access control configurations for different
workspace assets. You will then identify encryptions and permissions available for
data protection, such as IdP authentication, secrets, and table access control. Lastly,
you will describe security standards and configurations for compliance, including
cluster policies, Bring Your Own Key, and audit logs.

Prerequisites:

● Beginning-level knowledge of basic AWS cloud computing terms (ex. S3, VPC,
IAM, etc.)
● Beginning-level knowledge of basic Databricks concepts (ex. workspace,
clusters, notebooks, etc.)

Learning objectives:

● Describe components of the AWS Databricks platform architecture and

deployment model.
● Explain network security features including no public IP address, Bring Your
Own VPC, VPC peering, and IP access lists.
● Describe identity provider integrations and access control configurations for
an AWS Databricks workspace.
● Explain encryptions and permissions available for data protection, such as
identity provider authentication, secrets, and table access control.
● Describe security standards and configurations for compliance, including
cluster policies, Bring Your Own Key, and audit logs.

AWS Databricks SQL Administration

Type: Self-paced; Cost: Free for customers

Duration: 1 hour

Course description: In this course, you will learn how to set up and configure access
to the Databricks SQL Analytics user interface. The administrative tasks in this
course will be done using the Databricks Workspace and Databricks SQL Analytics
UI, and will not include instruction for API access. By the end of this course, you will
be able to set up computational resources for users, grant and revoke access to
specific data, manage users and groups, and set up alert destinations.

Prerequisites:

● Intermediate knowledge of Databricks

● Databricks account on the Premium plan (with SQL Analytics enabled)
● Administrator credentials to your organization’s Databricks Workspace

Learning objectives:

● Describe how Databricks SQL Analytics is used by data practitioners.

● Manage user and group access to Databricks SQL Analytics.
● Configure and monitor SQL Endpoints to maximize performance, control
costs, and track usage on Databricks SQL Analytics.
● Set up access to data storage through SQL endpoints or external data stores
in order for users to access data on Databricks SQL Analytics.
● Control user access to data objects (e.g. tables, databases, and views) by
programmatically setting privileges for specific users and/or groups on
Databricks SQL Analytics.
● Create and configure Databricks SQL Analytics alert destinations for users.

AWS Databricks Workspace Deployment

Type: Self-paced; Cost: Free for customers

Duration: 1 hour

Course description: This course will walk you through setting up your Databricks
account including setting up billing, configuring your AWS account, and adding users
with appropriate permissions. At the end of this course, you'll find guidance and
resources for additional setup options and best practices.

Prerequisites:

● Experience using a web browser.

● Note: To follow along with this course, you will need access to a Databricks
account with Account Owner permissions.

Learning objectives:

● Access the Databricks account console and set up billing.

● Configure an AWS account using cross-account role or access keys.
● Configure AWS storage and deploy the Databricks workspace.
● Add users and assign admin or cluster creation rights.
● Identify resources for additional setup options and best practices.

Azure Databricks Cloud Architecture and

System Integration Fundamentals
Type: Self-paced; Cost: Free for customers

Duration: 1 hour
Course description: While the Databricks Unified Analytics Platform provides a
broad range of functionality to many members of data teams, it is through
integrations with other services that most cloud-native applications will achieve
results desired by customers. This course is designed to help students understand
the portions of cloud workloads appropriate for Databricks, and highlight
integrations with first-party services in the Azure cloud to build scalable and secure
applications.

Prerequisites:

● Beginning knowledge of Spark programming (reading/writing data, batch and

Learning objectives:

● Describe use-cases for Azure Databricks in an enterprise cloud architecture.

● Configure secure connections to data in an Azure storage account.
● Configure connections from Databricks to various first-party tools, including
Synapse, Key Vault, Event Hubs, and CosmosDB.
● Configure Azure Data Factory to trigger production jobs on Databricks.
● Trigger CI/CD workloads on Databricks assets using Azure DevOps.

Azure Databricks Cluster Usage

Management
Type: Self-paced; Cost: Free for customers

Duration: 1 hour

Course description: In this course, you will first define computation resources
(clusters, jobs, and pools) and determine which resources to use for different
workloads. Then, you will learn cluster provisioning strategies for several use cases to
maximize usability and cost effectiveness. You will also identify best practices for
cluster governance, including cluster policies. This course also covers capacity
limits, cost management, and chargeback analysis.
Prerequisites:

● Beginning experience with the Databricks workspace UI

● Beginning experience with Databricks administration

Learning objectives:

● Define computation resources (clusters, jobs, and pools) and determine

Azure Databricks Data Access

Management
Type: Self-paced; Cost: Free for customers

Duration: 1 hour

Course description: In this course, you will learn about the Databricks File System
and Hive Metastore concepts. Then, you will apply best practices to secure access
to Azure data storage from Azure Databricks. Next, you will configure access control
for data objects including tables, databases, views, and functions. You will also apply
column and row-level permissions and data masking with dynamic views for
multiple users and groups. Lastly, you will identify methods for data isolation within
your organization on Azure Databricks.

Prerequisites:

● Beginning experience with Azure Databricks security, including deployment

architecture and encryptions
● Beginning experience with Azure Databricks administration, including identity
management and workspace access control
● Beginning experience using the Azure Databricks workspace
● Azure Databricks Premium Plan
Learning objectives:

● Describe the Databricks File System and Hive Metastore concepts.

● Apply best practices to secure access to Azure data storage from Azure
Databricks.
● Configure access control for data objects including tables, databases, views,
and functions.
● Apply column and row-level permissions and data masking with dynamic
views for multiple users and groups.
● Identify methods for data isolation within your organization on Azure
Databricks.

Azure Databricks Identity Access

Management
Type: Self-paced; Cost: Free for customers

Duration: 45 minutes

Course description: In this course, you will learn how to manage user accounts and
groups in the Admin Console. You will also learn how to manage token-based
authentication and settings for your workspace, such as workspace storage and
additional cluster configurations. Lastly, this course covers access control for
workspace objects, such as notebooks and folders, in addition to clusters, pools, and
jobs.

Prerequisites:

● Beginning experience using the Databricks workspace.

Learning objectives:

● Manage user accounts and groups in the Admin Console.

Duration: 1.5 hours

Course description: This course provides an overview of essential security features

to consider when managing your Azure Databricks workspace. You will start by
identifying components of the Azure Databricks platform architecture and
deployment model. Then, you will define several features regarding network security
including no public IPs, Bring Your Own VNET, VNET peering, and IP access lists. After
recognizing IdP and AAD integrations, you will explore access control configurations
for different workspace assets. You will then identify encryptions and permissions
available for data protection, such as IdP authentication, secrets, and table access
control. Lastly, you will describe security standards and configurations for
compliance, including cluster policies, Bring Your Own Key, and audit logs.

Prerequisites:

● Beginning-level knowledge of basic Azure cloud computing terms (ex. Blob

storage, ADLS, VNET, Azure Active Directory, etc.)
● Beginning-level knowledge of basic Databricks concepts (ex. workspace,
clusters, notebooks, etc.)

Learning objectives:

● Describe components of the Azure Databricks platform architecture and

deployment model.
● Explain network security features including no public IP address, Bring Your
Own VNET, VNET peering, and IP access lists.
● Describe identity provider and Azure Active Directory integrations and access
control configurations for an Azure Databricks workspace.
● Explain encryptions and permissions available for data protection, such as
identity provider authentication, secrets, and table access control.
● Describe security standards and configurations for compliance, including
cluster policies, Bring Your Own Key, and audit logs.
Azure Databricks SQL Administration
Type: Self-paced; Cost: Free for customers

Duration: 1 hour

Prerequisites:

● Intermediate knowledge of Databricks

● Databricks account on the Premium plan (with SQL Analytics enabled)
● Administrator credentials to your organization’s Databricks Workspace

Learning objectives:

● Describe how Databricks SQL Analytics is used by data practitioners.

Azure Databricks Workspace Deployment

Type: Self-paced; Cost: Free for customers

Duration: 10 minutes
Course description: In this course, you will identify the prerequisites for creating an
Azure Databricks workspace, deploy an Azure Databricks workspace in the Azure
portal, launch the workspace, and access the Admin Console.

Prerequisites:

● To complete the actions outlined in this course, you must have access to an
Azure subscription.

Learning objectives:

● Identify prerequisites for launching an Azure Databricks workspace.

● Deploy an Azure Databricks workspace in the Azure portal.
● Launch the deployed Azure Databricks workspace.
● Access the Admin Console in the deployed Azure Databricks workspace.

Certification Prep Course for the

Databricks Certified Associate Developer
for Apache Spark Exam
Type: Self-paced; Cost: Free for customers

Duration: 2 hours

Course description: Prepare to take the Databricks Certified Associate Developer for
Apache Spark Exam. This course will cover the format and structure of the exam,
skills needed for the exam, tips for exam preparation, and the parts of the
DataFrame API and Spark architecture covered in the exam.

Prerequisites:

● Describe the basics of the Apache Spark architecture.

● Perform basic data transformations using the Apache Spark DataFrame API
using Python or Scala.
● Perform basic data input and output using the Apache Spark DataFrame API
using Python or Scala.
● Perform custom data actions using user-defined functions using Python or
Scala.
● Perform data transformations using Spark SQL.
● Note: While the above skills are not necessary for this course, the course will
be far more helpful in preparing students if they have these skills.

Learning objectives:

● Summarize the learning context behind the Databricks Certified Associate

Developer for Apache Spark exam.
● Describe the topics covered in the Databricks Certified Associate Developer
for Apache Spark exam.
● Describe the format and structure of the Databricks Certified Associate
Developer for Apache Spark exam.
● Apply practical test-taking strategies to answer example questions similar to
those of the Databricks Certified Associate Developer for Apache Spark
exam.
● Identify resources that can be used to learn the material covered in the
Databricks Certified Associate Developer for Apache Spark exam.

Configuring Workspace Access Control

Lists (ACLs)
Type: Self-paced; Cost: Free for customers

Duration: 1 hour

Course description: Databricks has extensive access control lists (ACLs) for
workspace assets to help administrators restrict and grant access to appropriate
users. This course includes a set of instructions and caveats for configuring many of
these settings, as well as a video walkthrough showing this configuration and the
resultant user experience.

Prerequisites:

● Basic knowledge of the Databricks workspace

Learning objectives:

● Manage permissions for groups of users.

● Control access to notebooks and folders.
● Restrict access for cluster creation and editing.
● Change ownership of configured jobs.

Data Engineering with Databricks

Type: Instructor-led course: $1000 USD; Self-paced (available free for customers)

Duration: 2 days (12 hours)

Course description: This course begins with a review of programming with Spark
APIs and an introduction to key terms and definitions of Databricks data engineering
tools, followed by an overview of DB Connect, the Spark UI, and writing testable
code. Participants will learn about the Cloud Data Platform in terms of data
architecture concepts and will build an end-to-end OLAP data pipeline using Delta
Lake with batch and streaming data, learning best practices throughout. Participants
who wish to dive deeper into tuning and optimization can take the Advanced Data
Engineering with Databricks course.

Prerequisites:

● Intermediate to advanced programming skills in Python or Scala

● Intermediate to advanced SQL skills
● Beginning experience using the Spark DataFrames API
● Beginning knowledge of general data engineering concepts
● Beginning knowledge of the core features and use cases of Delta Lake

Learning objectives:

● Build an end-to-end batch and streaming OLAP data pipeline.

● Make data available for consumption by downstream stakeholders using
specified design patterns.
● Apply Databricks' recommended best practices in engineering a single source
of truth Delta architecture.

Course agenda:

● Note: There are approximately 13 hours of content in this course, divided into
four parts (Part 1, Part 2, Part 3, Part 4).
○ Two-day deliveries will go through Parts 1 and 2 on Day 1 and Parts 3
and 4 on Day 2.
○ Four-day day deliveries will go through one part per day.
● Part 1: Course Welcome and Setup
○ Introduction
○ The Big Picture
○ Software Engineering
○ Configuration and Utilities
○ Planning your Data Pipeline
○ Engineering a Data Pipeline
● Part 2: Streaming Delta Tables
○ Ingesting Raw Data
○ Raw to Bronze
○ Delta Table Versioning
○ Bronze to Silver
○ Silver Update
○ The Query Layer
○ Silver to Gold
● Part 3: Batch Delta Tables
○ Planning your Data Pipeline
○ Configuration and Utilities
○ Ingest - Raw
○ Raw to Bronze
○ Bronze to Silver
○ Silver Update
○ Delta Architecture
○ Schema Enforcement and Evolution
● Part 4: Compliance and Optimization
○ GDPR & CCPA Compliance
○ Normalization
○ SCD and CDC
○ Delta Engine Optimizations

Data Science on Databricks Rapid Start

Type: Self-paced; Cost: Free for customers

Duration: 1 hour

Course description: This course will provide an overview of the features and
functionality within the Unified Data Analytics Platform that enable data
practitioners to follow data science and machine learning workflows. Aside from an
overview of features and functionality, the course will provide learners with
hands-on experience using the Unified Data Analytics Platform to execute basic
tasks and solve a real-world problem.

Prerequisites:

● Beginning experience with Python as applied to data science and analysis.

● Beginning experience with popular data science tools such as Pandas,
charting libraries
● Beginning experience working with notebooks (not necessarily Databricks
notebooks)

Learning objectives:

● Summarize Databricks functionality that enables data practitioners to work

with data through the data science workflow.
● Summarize Databricks functionality that enables data practitioners to run
machine learning experiments on data.
● Solve a given real-world problem by executing basic data science tasks in the
Unified Data Analytics Platform.

Data Science on Databricks -

Bias-Variance Tradeoff
Type: Self-paced; Cost: Free for customers

Duration: 1 hour

Course description: In this course, we’ll show you how to use scikit-learn on
Databricks, along with some core statistical and data science principles, to select a
family of machine learning models for deployment.

This course is the first in a series of three courses developed to show you how to
use Databricks to work with a single data set from experimentation to
production-scale machine learning model deployment. The other courses in this
series include:

● Tracking Experiments with MLflow

● Deploying a Machine Learning Project with MLflow Project
Prerequisites:

● Beginning-level experience running data science workflows in the Databricks

Workspace
● Beginner-level experience with Apache Spark
● Intermediate-level experience with the Scipy Numerical Stack

Learning objectives:

● Create and explore an aggregate sample from user event data.

● Design an MLflow experiment to estimate model bias and variance.
● Use exploratory data analysis and estimated model bias and variance to
select a family of models for model development.

Data Visualization with Databricks SQL

Type: Self-paced; Cost: Free for customers

Duration: 1 hour

Course description: In this course, you will learn how to use Databricks SQL, an
integrated SQL editing and dashboarding tool, from your Databricks workspace.
Working with Databricks SQL allows you to easily query your data lake, or other data
sources, and build dashboards that can be easily shared across your organization.
You will learn how to parameterize queries so that users can easily modify
dashboard views to target specific results. Also, we will make use of alerts for
ongoing monitoring so that you can be notified when certain events occur or when
particular attributes of a data set reach a certain threshold.

Prerequisites:

● Access to the Databricks SQL interface

● Intermediate experience using the Databricks platform
● Intermediate experience with SQL
● Intermediate experience with data analysis concepts

Learning objectives:

● Describe how you can use SQL from your Databricks workspace.
● Execute queries and create visualizations using Databricks SQL.
● Write parameterized queries so that users can easily customize their results
and visualizations.
● Create and share dashboards that hold a collection of visualizations.

Databases, Tables, and Views on

Databricks
Type: Self-paced; Cost: Free for customers

Duration: 35 minutes

Course description: In this short course, you’ll learn how to create databases, tables,
and views on Databricks. Special attention will be given to differences in scope and
persistence for these various entities, allowing any user responsible for creating or
managing databases, tables, or views to make informed decisions for their
organization. While the syntax for creating and working with databases, tables, and
views will be familiar to most SQL users, some default behaviors may surprise users
new to Databricks.

Prerequisites:

● Beginning knowledge of SQL

● Beginning knowledge of loading and interacting with sample data from
Databricks.
● Beginning knowledge of using Databricks notebooks

Learning objectives:

● Describe persistence and scope of databases, tables, and views on

Databricks.
● Compare and contrast the behavior of managed and unmanaged tables.
● Summarize best practices for creating and managing databases, tables, and
views on Databricks.
Databricks Command Line Interface (CLI)
Fundamentals
Type: Self-paced; Cost: Free for customers

Duration: 45 minutes

Course description: While the Databricks platform web-based graphical user

interface provides powerful functionality for data teams, many use cases call for
programmatic command line access. The Databricks command line interface (CLI)
provides access to a variety of powerful workspace features. This module is not
intended as a comprehensive overview of all the CLI can do, but rather an
introduction to some of the common features users may desire to leverage in their
workloads.

Prerequisites:

● Familiarity with Apache Spark concepts

● Familiarity with the data engineering capabilities of the Databricks Platform
● Intermediate experience using the Databricks platform for data engineering
(creating clusters, loading notebooks, scheduling jobs, etc.)

Learning objectives:

● Install and configure the Databricks CLI to securely interact with the
Databricks Workspace.
● Configure workspace secrets using the CLI for more secure sharing and use of
string-based credentials in notebooks.
● Sync notebooks and libraries between the Databricks workspace and other
environments using the CLI.
● Perform a variety of tasks including interacting with clusters, jobs, and runs
using the CLI.

Databricks Datadog Integration

Type: Self-paced; Cost: Free for customers
Duration: 1 hour

Course description: Datadog provides customizable integration scripts and

dashboards to integrate your Databricks logs into your larger monitoring ecosystem.
This lesson goes through basic configuration, as well as extending this configuration
to add additional security and custom tagging.

Prerequisites:

● Basic familiarity with the Databricks workspace

● Basic familiarity with cluster configuration

Learning objectives:

● Configure both ends of the Databricks Datadog integration.

● Add custom variables to your monitored clusters.
● Use Databricks secrets to redact API tokens.

Databricks on Google Cloud: Architecture

and Security Fundamentals
Type: Self-paced; Cost: Free for customers

Duration: 1.5 hours

Course description:This course dives into the platform architecture and key security
features of Databricks on Google Cloud. You will start with an overview of Databricks
on Google Cloud and how it integrates with the Google Cloud ecosystem. Then, you
will define core components of the platform architecture and deployment model on
Databricks on Google Cloud. You will also learn about key security features to
consider when provisioning and managing workspaces, as well as guidelines on
network security, identity and access management, and data protection.

Prerequisites & Requirements

● Prerequisites
○ Basic familiarity with Databricks concepts (workspace, notebooks,
clusters, DBFS, etc)
○ Basic familiarity with Google Cloud concepts (projects, IAM, GCS, VPC,
subnets, GKE, etc)
Learning objectives

● Describe how Databricks integrates with the Google Cloud ecosystem.

● Identify components of the Databricks on Google Cloud platform architecture
and deployment model.
● Recognize best practices for network security when deploying workspaces.
● Describe identity management and access control features in Databricks on
Google Cloud.
● Identify storage locations and data protection features in Databricks on
Google Cloud.

Databricks on Google Cloud: Cloud

Architecture and System Integration
Type: Self-paced; Cost: Free for customers

Duration: 1.5 hours

Course description:This course is a series of demos designed to help students

understand the portions of cloud workloads appropriate for Databricks. Within these
demos, we'll highlight integrations with first-party services in Google Cloud to build
scalable and secure applications.

Prerequisites & Requirements

● Prerequisites
○ Familiarity with the Databricks on Google Cloud workspace
○ Beginning knowledge of Spark programming (reading/writing data,
batch and streaming jobs, transformations and actions)
○ Beginning-level experience using Python or Scala to perform basic
control flow operations.
○ Familiarity with navigation and resource configuration in the Databricks
on Google Cloud Console.

Learning objectives

● Describe where Databricks fits into a cloud-based architecture on Google

Cloud.
● Authenticate to Google Cloud resources with service account credentials.
● Read and write data to Cloud Storage using Databricks secrets.
● Mount a GCS bucket to DBFS using cluster-wide service accounts.
● Configure a cluster to read and write data to BigQuery using credentials in
DBFS.

Databricks on Google Cloud: Cluster

Usage Management
Type: Self-paced; Cost: Free for customers

Duration: 30 minutes

Course description: This course covers essential cluster configuration features and
provisioning guidelines for Databricks on Google Cloud. In this course, you will start
by defining core computation resources (clusters, jobs, and pools) and determine
which resources to use for different workloads. Then, you will learn cluster
provisioning strategies for several use cases to maximize manageability. Lastly, you
will learn how to manage cluster usage and cost for your Databricks on Google Cloud
workspaces.

Prerequisites & Requirements

● Prerequisites
○ Beginning experience using the Databricks workspace
○ Beginning experience with Databricks administration

Learning objectives

● Describe the core computation resources in Databricks, clusters, jobs, and

pools.
● Recognize best practices for configuring cluster resources for different
workloads.
● Identify cluster provisioning use cases and strategies for manageability.
● Describe how to manage cluster usage and cost for Databricks on Google
Cloud.
Databricks on Google Cloud: Workspace
Deployment
Type: Self-paced; Cost: Free for customers

Duration: 20 minutes

Course description: This is a short course that shows new customers how to set up
a Databricks account and deploy a workspace on Google Cloud. This will cover
accessing the Account Console and adding account admins, provisioning and
accessing workspaces, and adding users and admins to a workspace.

Prerequisites & Requirements

● Prerequisites
○ Basic familiarity with Databricks concepts (Databricks account,
workspace, DBFS, etc)
○ Basic familiarity with Google Cloud concepts (Cloud console, project,
GCS, IAM, VPC, etc)

Learning objectives

● Access the Databricks Account Console.

● Add account admins in the Account Console.
● Provision and access a Databricks workspace.
● Access the Admin Console for a Databricks workspace.
● Add workspace users and admins in the Admin Console.

Databricks with R
Type: Self-paced; Cost: Free for customers

Duration: 7 hours

Course description:In this seven-hour course, you will analyze clickstream data from
an imaginary mattress retailer called Bedbricks. In this case study, you'll explore the
fundamentals of Spark Programming with R on Databricks, including Spark
architecture, the DataFrame API, and Machine Learning.
Prerequisites & Requirements

● Prerequisites
○ Beginning experience working with R.

Learning objectives

● Identify core features of Spark and Databricks.

● Describe how DataFrames are created and evaluated in Spark.
● Apply the DataFrame transformation API to process and analyze data.

Databricks Workspace Fundamentals for

Business Analytics
Type: Self-paced; Cost: Free for customers

Duration: 1 hour

Course description: This course will showcase Databricks Workspace (Workspace)

functionality useful to business/SQL analysts. It will begin with a review of
vocabulary terms helpful for working in the Workspace and include simple workflows
that data practitioners can use to quickly query and analyze data.

Prerequisites:

● Beginning experience with SQL is helpful (examples will be done using SQL)
● Beginning knowledge about what Databricks is and what it does

Learning objectives:

● Summarizing fundamental concepts for using Databricks effectively.

● Explain how data professionals can use Databricks to extract meaning from
data.
● Use Databricks to follow a traditional data workflow to extract meaning from
data.

Delta Lake Rapid Start with Python

Type: Self-paced; Cost: Free for customers
Duration: 2 hours

Course description: Apache Spark™ is the dominant processing framework for big
data. Delta Lake is a robust storage solution designed specifically to work with
Apache SparkTM. It adds reliability to Spark so your analytics and machine learning
initiatives have ready access to quality, reliable data. Delta Lake makes data lakes
easier to work with and more robust. It is designed to address many of the problems
commonly found with data lakes. This course covers the basics of working with
Delta Lake, specifically with Python, on Databricks.

Prerequisites:

● Beginning level experience using Databricks to upload and visualize data

● Intermediate level experience using Apache Spark including the CTAS pattern
and use of popular pyspark.sql functions
● Beginning level knowledge of Delta Lake

Learning objectives:

● Use Delta Lake to create a new Delta table.

● Convert an existing Parquet-based data lake table into a Delta table.
● Differentiate between a batch update and an upsert to a Delta table.
● Use Delta Lake Time Travel to view different versions of a Delta table.
● Execute a MERGE command to upsert data into a Delta table.

Delta Lake Rapid Start with Spark SQL

Type: Self-paced; Cost: Free for customers

Duration: 2 hours

Course description: Apache Spark™ is the dominant processing framework for big
data. Delta Lake is a robust storage solution designed specifically to work with
Apache Spark™. It adds reliability to Spark so your analytics and machine learning
initiatives have ready access to quality, reliable data. Delta Lake makes data lakes
easier to work with and more robust. It is designed to address many of the problems
commonly found with data lakes. This course covers the basics of working with
Delta Lake, specifically with Spark SQL, on Databricks.

Prerequisites:
● How to upload data into a Databricks Workspace
● How to visualize data using Databricks
● Intermediate level Spark SQL usage including the CTAS pattern, use of Spark
SQL functions such as from_unixtime, lag, lead, and partitioning.

Learning objectives:

● Use Delta Lake to create a new Delta table and to convert an existing
Parquet-based data lake table
● Differentiate between a batch append and an upsert to a Delta table
● Use Delta Lake Time Travel to view different versions of a Delta tables
● Execute a MERGE command to upsert data into a Delta table

Deploying a Machine Learning Project with

MLflow Projects
Type: Self-paced; Cost: Free for customers

Duration: 2 hours

Course description: In this course, we’ll show you how to train and deploy a large
scale machine learning model using MLflow and Apache Spark. This course is the
third in a series of three courses developed to show you how to use Databricks to
work with a single data set from experimentation to production-scale machine
learning model deployment. We recommend taking the first two courses in this
series before continuing with this course:

● Building and Deploying Machine Learning Models: The Bias-Variance Tradeoff

● Tracking Experiments with MLflow

Prerequisites:

● Beginning-level experience running data science workflows in the Databricks

Workspace
● Beginner-level experience with Apache Spark
● Intermediate-level experience with the Scipy Numerical Stack
● Intermediate-level experience with the command line

Learning objectives:
● Summarize Databricks best practices for deploying machine learning projects
with MLflow.
● Explain local development strategies for writing software with Databricks.
● Use Databricks to write production-grade machine learning software.

ELT with Spark SQL

Type: Self-paced; Cost: Free for customers

Duration: 2.5 hours

Course description: This course teaches experienced SQL analysts and engineers
how to complete common ELT tasks using Spark SQL on Databricks. Students will
extract data from multiple data sources, load data into Delta Lake tables, and apply
data quality checks and transformations. Students will also learn how to leverage
existing tables in a Lakehouse for last-mile ETL to support dashboards and
reporting.

Prerequisites:

● Students should be able to navigate the Databricks workspace (creating and

loading notebooks, connecting to clusters)
● Students should have intermediate fluency in SQL
● Students should be familiar with relational entities on Databricks
● Students should be familiar with the high-level architecture of the Lakehouse

Learning objectives:

● Extract data from a variety of common data sources using Spark SQL in the
Databricks Data Science and Engineering workspace
● Load data into Delta Lake tables using the Databricks Data Science and
Engineering workspace
● Apply transformations to complete common cleaning tasks and data quality
checks using the Databricks Data Science and Engineering workspace
● Reshape datasets with advanced functions to derive analytical insights using
the Databricks Data Science and Engineering workspace
Easy ETL with Auto Loader
Type: Self-paced; Cost: Free for customers

Duration: 1 hour

Course description: Databricks Auto Loader is the preferred method for ingesting
incremental data landing in cloud object storage into your Lakehouse. This course
introduces Auto Loader and demonstrates some of the newer features added to this
product. Included are recommended patterns for data ingestion with Auto Loader.

Prerequisites:

● Basic experience with Spark APIs

● Basic knowledge of Delta Lake
● Basic experience with Structured Streaming

Learning objectives:

● Describe the basic functionality and features of Auto Loader.

● Use Auto Loader to ingest data to Delta Lake without losing data.
● Configure automatic schema detection and evolution.
● Rescue unexpected data arriving in well-structured datasets.

Enterprise Architecture with Databricks

Type: Self-paced; Cost: Free for customers

Duration: 7 hours

Course description: In this course you’ll learn about how business leaders, admins,
and architects use Databricks in their architecture . We’ll cover fundamental
concepts about key players: Data Engineers, Data Scientists, Platform Administrator;
raw data forms: structured and unstructured data, batch and streaming data, to help
set the stage for our discussion on how end users help businesses create data
assets like machine learning models, reports, and dashboards. Then, we’ll discuss
where components of Databricks Azure fit into an organization’s big data ecosystem.
Finally, we’ll review real-world business use cases and create enterprise level
architecture infrastructure diagrams.
Prerequisites:

● Beginning knowledge about characteristics that define big data (3 of the Vs of

big data - velocity, volume, variety)
● Beginning knowledge about how organizations process and manage big data
(Relational/SQL vs NoSQL, cloud vs. on-premise, open-source database vs.
closed-source database as a service)
● Beginning knowledge about the roles that data practitioners play on data
science teams (can distinguish between database administrators and data
scientists, data analysts and machine learning engineers, data engineers and
platform administrators)

Learning objectives:

● Create a requirements document which profiles the data needs of an

organization.
● Translate business needs related to data analytics into technical
requirements used for drawing an architectural diagram.
● Translate the Databricks Lakehouse Architecture with Delta to a technical
requirements document.
● Design Azure Databricks architectures that includes integration with Azure
services, for real-world scenarios.
● Evaluate, analyze, and validate detailed infrastructure designs.
● Create infrastructure designs.

Fundamentals of Big Data

Type: Self-paced; Cost: Free for customers

Duration: 1 hour

Course description: This course was created for individuals who are new to the big
data landscape and want to become conversant with big data terminology. It will
cover foundational concepts related to the big data landscape including:
characteristics of big data; the relationship between big data, artificial intelligence,
and data science; how individuals on data science teams work with big data; and
how organizations can use big data to enable better business decisions.

Prerequisites:
● Experience using a web browser

Learning objectives:

● Explain foundational concepts used to define big data.

● Explain how the characteristics of big data have changed traditional
organizational workflows for working with data.
● Summarize how individuals on data science teams work with big data on a
daily basis to drive business outcomes.
● Articulate examples of real-world use-cases for big data in businesses across
a variety of industries.

Fundamentals of Cloud Computing

Type: Self-paced; Cost: Free for customers

Duration: 30 minutes

Course description: This introductory-level course is designed to familiarize

individuals new to the cloud computing landscape. It will cover foundational
concepts related to cloud computing starting with the basics - what cloud
computing is and why, since 2011, over 30% of organizations have moved their
operations to the cloud. The course will also cover topics like cloud delivery models
and deployment types.

Please note that this course is not about cloud computing in general and does not
focus on Databricks, specifically.

Prerequisites:

● Experience using a web browser

Learning objectives:

● Summarize foundational concepts about cloud computing.

● Describe major cloud computing components.
● Explain the three major cloud computing delivery models.
● Explain the three major cloud computing deployment models.
● Outline the benefits of moving an organization’s operations to the cloud.
Fundamentals of Databricks Machine
Learning
Type: Self-paced; Cost: Free for customers

Duration: 30 minutes

Course description: Databricks Machine Learning offers data scientists and other
machine learning practitioners a platform for completing and managing the
end-to-end machine learning lifecycle. This course guides business leaders and
practitioners through a basic overview of Databricks Machine Learning, the benefits
of using Databricks Machine Learning, its fundamental components and
functionalities, and examples of successful customer use.

Prerequisites:

● Beginning-level knowledge of the Databricks Lakehouse platform

Learning objectives:

● Describe the basic overview of Databricks Machine Learning.

● Identify how using Databricks Machine Learning benefits data science and
machine learning teams.
● Summarize the fundamental components and functionalities of Databricks
Machine Learning.
● Exemplify successful use cases of Databricks Machine Learning by real
Databricks customers.

Fundamentals of Databricks SQL

Type: Self-paced; Cost: Free for customers

Duration: 30 minutes

Course description: Databricks SQL offers SQL users a platform for querying,
analyzing, and visualizing data in their organizations Lakehouse. This course explains
how Databricks SQL processes queries and guides users through how to use the
interface. Then, this course will explain how you can connect to Databricks SQL to
your favorite business intelligence tool, so that you can query your Lakehouse
without making changes to your analytical and dashboarding workflows.

Prerequisites:

● None.

Learning objectives:

● Summarize fundamental concepts for using Databricks SQL effectively.

● Identify tools and features in Databricks SQL for querying and analyzing data
as well as sharing insights with the larger organization.
● Explain how Databricks SQL supports data analysis workflows that allow users
to extract and share business insights

Fundamentals of Delta Lake

Type: Self-paced; Cost: Free for customers

Duration: 30 minutes

Course description: Delta Lake is an open format storage layer that sits on top of
your organization’s data lake. It is the foundation of a cost-effective, highly scalable
Lakehouse and is an integral part of the Databricks Lakehouse Platform.

In this course, we’ll break down the basics behind Delta Lake - what it does, how it
works, and why it is valuable from a business perspective, to any organization with
big data and AI projects.

Note: This is an introductory-level course that will not showcase in-depth

technical Delta Lake demos nor provide hands-on technical training with Delta Lake.
Please see the Delta Lake Rapidstart courses available in the Databricks Academy
for technical training on Delta Lake.

Prerequisites:

● Beginning knowledge of the Databricks Lakehouse Platform. We

recommended taking the course Fundamentals of the Databricks Lakehouse
Platform prior to taking this course.

Learning objectives:
● Describe how Delta Lake fits into the Databricks Lakehouse Platform.
● Explain the four elements encompassed by Delta Lake.
● Summarize high-level Delta Lake functionality that helps organizations solve
common challenges related to enterprise-scale data analytics.
● Articulate examples of how organizations have employed Delta Lake on
Databricks to improve business outcomes.

Fundamentals of Enterprise Data

Management Systems
Type: Self-paced; Cost: Free for customers

Duration: 1 hour

Course description: Whether your organization is moving to the cloud for the first
time or reevaluating its current approach, making decisions about the technology
used when storing your data can have huge implications for costs and performance
in downstream analytics. As a platform focused on computation and analytics,
Databricks seeks to help our customers make choices that unlock new
opportunities, reduce redundancies, and connect data teams. In this course, you’ll
start by exploring the characteristics of data lakes, and data warehouses, two
popular data storage technologies. Then, you’ll learn about the Lakehouse, a new
data storage system invented and made popular by Databricks.

Prerequisites:

● Beginning knowledge about the Databricks Unified Data Analytics Platform.

● We recommend taking the courses: Fundamentals of Big Data and
Fundamentals of Unified Data Analytics with Databricks prior to taking this
course.

Learning objectives:

● Describe the strengths and limitations of data lakes, related to data storage.
● Describe the strengths and limitations of data warehouses, related to data
storage.
● Contrast data lake and data warehouse characteristics.
● Compare the features of a Lakehouse to the features of popular data storage
management solutions.

Fundamentals of Lakehouse Architecture

Type: Self-paced; Cost: Free for customers

Duration: 10 minutes

Course description: In this ten-minute video, you will learn about the Lakehouse, a
new data management architectural pattern that offers state-of-the-art support
and performance for data science, machine learning, and business analytics
applications. Furthermore, we will explore how Delta Lake plays an integral part in
creating a Lakehouse.

Prerequisites:

● None

Learning objectives:

● Identify analytics as a key component for data-driven decision making.

● Describe differences between common types of data storage solutions.
● Define the term "lakehouse" with respect to data management systems.
● Cite types of optimizations you can use with Delta Engine.

Fundamentals of Machine Learning

Type: Self-paced; Cost: Free for customers

Duration: 1 hour

Course description: In this course you’ll learn fundamental concepts about machine
learning. First, we’ll review machine learning basics - what it is, why it’s used, and
how it relates to data science. Then, we’ll explore the two primary categories that
machine learning problems are categorized into - supervised and unsupervised
learning. Finally, we’ll review how the machine learning workflow fits into the data
science process.

Prerequisites & Requirements

● Prerequisites
○ Beginning knowledge about concepts related to the big data landscape
helpful but not required (i.e. big data types, analysis techniques,
processing techniques, etc.)
○ We recommend taking the Databricks Academy course "Introduction to
Big Data" before taking this course.

Learning objectives

● Explain how machine learning is used as an analysis tool in data science.

● Summarize the relationship between the data science process and the
machine learning workflow.
● Describe the two primary categories that machine learning problems are
categorized into.
● Describe popular machine learning techniques within the two primary
categories of machine learning.
● Determine the machine learning technique that should be used to analyze
data in a given real-world scenario.

Fundamentals of Structured Streaming

Type: Self-paced; Cost: Free for customers

Duration: 1 hour

Course description: A common struggle that organizations face is how to accurately

ingest and perform calculations on real-time data. This data is also referred to as
streaming data, and the challenges behind working with it lie in its real-time nature -
because it is constantly arriving, mechanisms must be put into place to process and
write to a data store. In this course, you’ll learn about Structured Streaming, an
Apache Spark API that helps data practitioners overcome the challenges of working
with streaming data. We’ll cover fundamental concepts about batch and streaming
data to help set the stage for our discussion on Structured Streaming. Then, we’ll
discuss where Structured Streaming fits into an organization’s big data ecosystem.
Finally, we’ll review real-world Structured Streaming business use cases.

Prerequisites & Requirements

● Prerequisites
○ Beginning knowledge about the Databricks Unified Data Analytics
Platform (what it is, what it is used for)
○ Beginning knowledge about concepts related to the big data landscape
(for example: structured streaming, batch processing, data pipelines)
○ Note: We recommend taking the following two Databricks Academy
courses to help you prepare for this course: Fundamentals of Big Data
and Fundamentals of Unified Data Analytics with Databricks.

Learning objectives

● Explain the benefits of Structured Streaming for working with streaming data.
● Distinguish where Structured Streaming fits into an organization’s big data
ecosystem.
● Articulate examples of real-world business use cases for Structured
Streaming.
● Describe popular machine learning techniques within the two primary
categories of machine learning.

Fundamentals of the Databricks

Lakehouse Platform
Type: Self-paced; Cost: Free for customers

Duration: 30 minutes

Course description: This course is designed for everyone who is brand new to the
Platform and wants to learn more about what it is, why it was developed, what it
does, and the components that make it up.

Our goal is that by the time you finish this course, you’ll have a better understanding
of the Platform in general and be able to answer questions like: What is Databricks?
Where does Databricks fit into my workflow? How have other customers been
successful with Databricks?

NOTE: THIS COURSE DOES NOT CONTAIN HANDS-ON PRACTICE WITH THE
DATABRICKS LAKEHOUSE PLATFORM. IT WAS DEVELOPED AS A PREREQUISITE TO
OTHER COURSES WHICH DO HAVE HANDS-ON COMPONENTS.

Prerequisites:
● Experience using a web browser.

Learning objectives:

● Describe what the Databricks Lakehouse Platform is.

● Explain the origins of the Lakehouse data management paradigm.
● Outline fundamental problems that cause most enterprises to struggle with
managing and making use of their data.
● Identify the most popular components of the Databricks Lakehouse Platform
used by data practitioners, depending on their unique role.
● Give examples of organizations that have used the Databricks Lakehouse
Platform to streamline big data processing and analytics.
● Describe security features that come built-in to the Databricks Lakehouse
Platform.

Google Cloud Fundamentals

Type: Self-paced; Cost: Free for customers

Duration: 1.5 hours

Course description: Learn the basics of Google Cloud and how to configure various
resources using the Cloud Console. This course begins with an overview of the
platform, key terminology, and core services. You will then learn essential IAM
concepts and how service accounts can be used to manage resources. You will also
learn about the function and use cases of several storage services, such as Cloud
Storage, Cloud SQL, and BigQuery. This course also covers virtual machine and
networking concepts in Compute Engine and VPC services. The course ends with an
overview of GKE clusters and Kubernetes concepts.

Prerequisites:

● Familiarity with basic cloud computing concepts (cloud computing, cloud

storage, virtual machine, database, data warehouse)

Learning objectives:

● Define basic concepts and core services in the Google Cloud Platform.
● Describe IAM concepts and how service accounts can be used to manage
resources.
● Identify use cases for storage services, such as Cloud Storage, Cloud SQL,
and BigQuery.
● Define virtual machine and networking concepts in Compute Engine and VPC
services.
● Describe Google Kubernetes Engine and the core components of Kubernetes
clusters.

How to Code-Along with Self-Paced

Courses
Type: Self-paced; Cost: Free for customers

Duration: 3 minutes

Course description: Some Databricks Academy courses are designed with hands-on
coding exercises that you can follow along with. This is a 2 minute video that guides
you through navigating through these courses and demonstrates the recommended
workflow for completing these coding exercises.

Prerequisites:

● Beginning knowledge about the Databricks Collaborative Data Science

Workspace environment
● Beginning knowledge about uploading data to the Collaborative Data Science
Workspace

Learning objectives:

● Describe how to complete coding exercises in code-based self-paced

courses.

How to Manage Clusters in Databricks

Type: Self-paced; Cost: Free for customers

Duration: 15 minutes

Course description: In this course, you’ll learn a series of skills for working with and
configuring clusters in the Databricks Collaborative Data Science Workspace
(Workspace) including exploring cluster functions and creating, displaying, cloning,
editing, pinning, terminating, and deleting a cluster.

Prerequisites:

● Beginning knowledge about the Databricks Collaborative Data Science

Workspace environment

Learning objectives:

● Use the Databricks Workspace to perform a variety of cluster management

tasks.

Introduction to Apache Spark

Architecture
Type: Self-paced; Cost: Free for customers

Duration: 1 hour

Course description: In this course, you will explore how Apache Spark executes a
series of queries. Examples will include simple narrow transformations and more
complex wide transformations.

This course will give developers a working understanding of how to write code that
leverages the power of Apache Spark for even the simplest of queries.

Prerequisites:

● Familiarity with basic information about Apache Spark (what it is, what it is
used for)

Learning objectives:

● Explain how Apache Spark applications are divided into jobs, stages, and
tasks.
● Explain the major components of Apache Spark's distributed architecture.
Introduction to Applied Linear Models
Type: Self-paced; Cost: Free for customers

Duration: 1 hour

Course description: Linear modeling is a popular starting point for machine learning
studies for a number of reasons. Generally, these models are relatively easy to
interpret and explain, and they can be applied to a broad range of problems. In this
course, you will learn how to choose, apply, and evaluate commonly used linear
modeling techniques. As you work through the course, you can put your new skills to
practice in 5 hands-on labs.

Prerequisites & Requirements

● Prerequisites
○ Intermediate experience with machine learning (experience using
machine learning and data science libraries like scikit-learn and
Pandas, knowledge of linear models).
○ Intermediate experience using the Databricks Workspace to perform
data analysis (using Spark DataFrames, Databricks notebooks, etc.).
○ Beginning experience with statistical concepts commonly used in data
science.

Learning objectives

● Describe and evaluate linear regression for regression problems.

● Describe how to ensure machine learning models generalize to out-of-sample
data.
● Describe and evaluate logistic regression for classification problems.
● Practice using linear modeling techniques using the Databricks Data Science
Workspace.

Introduction to Applied Statistics

Type: Self-paced; Cost: Free for customers

Duration: 1 hour
Course description: In this course you’ll learn, both in theory and in practice, about
statistical techniques that are fundamental to many data science projects.
Throughout the course, videos will guide you through the conceptual information
you need to know about these statistical concepts, and hands-on lab activities will
give you the chance to apply the concepts you learn using the Databricks
Workspace. This course is divided into three modules: Introduction to Statistics and
Probability, Probability Distributions, and Applying Statistics to Learn from Data.

Prerequisites & Requirements

● Prerequisites
○ Beginning experience using the Databricks Data Science Workspace
(familiarity with Spark SQL, experience importing files into the
Databricks Data Science Workspace)
○ Beginning experience using Python (ability to follow guided use of the
SciPy library)

Learning objectives

● Contrast descriptive statistics and inferential statistics.

● Explain fundamental concepts behind discrete probability.
● Compare and contrast discrete and continuous probability distributions.
● Explain how discrete and continuous probability distributions can be used to
model data.
● Apply hypothesis testing techniques to learn from data.

Introduction to Applied Tree-Based

Models
Type: Self-paced; Cost: Free for customers

Duration: 3 hours

Course description: In this course, you’ll learn how to solve complex supervised
learning problems using tree-based models. First, we’ll explain how decision trees
can be used to identify complex relationships in data. Then, we’ll show you how to
develop a random forest model to build upon decision trees and improve model
generalization. Finally, we’ll introduce you to various techniques that you can use to
account for class imbalances in a dataset. Throughout the course, you’ll have the
opportunity to practice concepts learned in hands-on labs.

Prerequisites:

● Intermediate level knowledge about machine learning/machine learning

workflows (feature engineering and selection, applying tree-based models,
etc.)
● We recommend that you take the following courses prior to taking this
course: Fundamentals of Machine Learning, Introduction to Feature
Engineering and Selection with Databricks, Applied Unsupervised Learning
with Databricks.

Learning objectives:

● Describe how decision trees are used to solve supervised learning problems.
● Identify complex relationships in data using decision trees.
● Develop a random forest model to build upon decision trees and improve
model generalization.
● Employ common techniques to account for class imbalances in a dataset.

Introduction to Applied Unsupervised

Learning
Type: Self-paced; Cost: Free for customers

Duration: 3 hours

Course description: In this course, we will describe and demonstrate how to learn
from data using unsupervised learning techniques during exploratory data analysis.
The course is divided into two sections – one of which will focus on K-means
clustering and the other will describe principal components analysis, commonly
referred to as PCA. Each section includes demonstrations of important concepts, a
quiz to solidify your understanding, and a lab to practice your skills.

Prerequisites:

● Intermediate experience with machine learning (experience using machine

learning and data science libraries like scikit-learn and Pandas, knowledge of
linear models)
● Intermediate experience using the Databricks Workspace to perform data
analysis (using Spark DataFrames, Databricks notebooks, etc.)
● Beginning experience with machine learning concepts.

Learning objectives:

● Identify relationships between data records using K-means clustering.

● Identify patterns in a high-dimensional feature space using principal
components analysis.
● Learn from data using unsupervised learning techniques during exploratory
data analysis.

Introduction to AutoML
Type: Self-paced; Cost: Free for customers

Duration: 30 minutes

Course description: This course serves as a quick introduction to Databricks ML.

With AutoML, ML experts can accelerate their workflows by fast-forwarding through
the usual trial-and-error and focus on customizations using their domain knowledge,
and citizen data scientists can quickly achieve usable results with a low-code
approach.

Prerequisites:

● Familiarity with Databricks Machine Learning.

Learning objectives:

● Explain what AutoML is.

● Describe how AutoML fits into the Databricks ecosystem.
● Articulate how to use AutoML for appropriate use cases.
● Demonstrate how to use AutoML for given use cases.

Introduction to Cloning with Delta Lake

Type: Self-paced; Cost: Free for customers

Duration: 30 minutes
Course description: The addition of clone to Delta Lake empowers data engineers
and administrators to easily replicate data stored in the Lakehouse. Organizations
can use deep clone to archive versions of their production tables for regulatory
compliance. Developers can easily create development datasets isolated from
production data with shallow clone. In this course, you’ll learn the basics of cloning
with Delta Lake and get hands-on experience working with the syntax.

Prerequisites:

● Hands-on experience working with Delta Lake

● Intermediate experience with Spark and Databricks

Learning objectives:

● Describe the basic execution of deep and shallow clones with Delta Lake.
● Use deep clones to create full incremental backups of tables.
● Use shallow clones to create development datasets.
● Describe strengths, limitations, and caveats of each type of clone.

Introduction to Databricks Connect

Type: Self-paced; Cost: Free for customers

Duration: 40 minutes

Course description: In this course, participants will be introduced to DB Connect

through various presentations and demos. Participants will start by contrasting how
DB Connect works to other development patterns. Then we will explore the
simplicity by which DB Connect is installed and configured. And then we will
conclude with a real-time demonstration of an application running on a developer’s
local machine while executing its Spark jobs against a cluster in the Databricks
workspace.

Prerequisites:

● Intermediate experience using the Databricks Workspace

Learning objectives:

● Explain how Databricks Connect is used by data practitioners working with

Databricks.
● Install and configure Databricks Connect.

Introduction to Databricks Machine

Learning
Type: Self-paced; Cost: Free for customers

Duration: 2 hours

Course description: Databricks Machine Learning offers data scientists and other
machine learning practitioners a platform for completing and managing the
end-to-end machine learning lifecycle. This course guides practitioners through a
basic machine learning workflow using Databricks Machine Learning. Along the way,
students will learn how each of Databricks Machine Learning’s features better enable
data scientists and machine learning engineers to complete their work effectively
and efficiently.

Prerequisites:

● Beginning-level knowledge of the Databricks Lakehouse platform

● Intermediate-level knowledge of Python
● Intermediate-level knowledge of machine learning workflows

Learning objectives:

● Describe a basic overview of Databricks Machine Learning.

● Create a feature table for downstream modeling using Feature Store.
● Automatically develop a baseline model using AutoML.
● Manage the model lifecycle using Model Registry.
● Perform batch inference using the registered model and feature table.
● Schedule a monthly model refresh using Databricks Jobs and AutoML.

Introduction to Databricks Repos

Type: Self-paced; Cost: Free for customers

Duration: 30 minutes
Course description: Repos aims to make Databricks simple to use by giving data
scientists and engineers the familiar tools of git repositories and file systems. These
tools enable a more laptop-like developer experience for customers. Repos is the
new, top-level, customer-facing feature that packages these tools together in the
Databricks user interface. This course teaches how to get started with Repos.

Prerequisites:

● Familiarity with Git and Git commands

● Familiarity with Databricks workspaces

Learning objectives:

● Describe the motivations for Databricks Repos.

● Configure workspace integration with Github.
● Sync local and remote notebook changes using Repos

Introduction to Delta Lake

Type: Self-paced; Cost: Free for customers

Duration: 1 hour, 15 minutes

Course description: Delta Lake is a powerful tool created by Databricks. Delta Lake
is an open, reliable, performant and secure data storage and management layer for
your data lake that enables you to create a true single source of truth. Since it is
built upon Apache Spark, you’re able to build high performance data pipelines to
clean your data from raw ingestion to business level aggregates. Finally, given the
open format - it allows you to avoid unnecessary replication and proprietary lock-in.

Ultimately - it provides the reliability, performance, and security you need to serve
your downstream data use cases.

Prerequisites:

● Intermediate SQL skills (e.g. can do CRUD statements in SQL)

● Beginner experience with working on Databricks in the Data Science &
Engineering workspace or the Machine Learning workspace (e.g. can import
DBC files, can access workspaces). Also note: although this course relies
heavily on SQL as a language, this is not intended for learners who primarily
use the Databricks SQL workspace products.
● Beginner experience with working with data pipelines is helpful

Learning objectives:

● Describe the basic features and technical implementation of Delta Lake.

● Ingest data and manage Delta tables to keep data complete, up-to-date, and
organized.
● Optimize Delta performance using common strategies.

Introduction to Delta Live Tables

Type: Self-paced; Cost: Free for customers

Duration: 30 minutes

Course description: Delta Live Tables enables data teams to innovate rapidly with
simple development, using declarative tools to build and manage batch or streaming
data pipelines. Built-in quality controls and data quality monitoring ensure accurate
and useful BI, Data Science, and ML built on top of quality data. Delta Live Tables is
designed to scale with rapidly growing companies and provides clear observability
into pipeline operations and automatic error handling. This course will cover the
basics of this new product, including syntax, configuration, and deployment.

Prerequisites:

● Beginner experience working with PySpark or Spark SQL

● Basic familiarity with the Databricks workspace

Learning objectives:

● Describe the motivations for Delta Live Tables.

● Use PySpark or SQL syntax to declare Delta Live Tables.
● Schedule and deploy pipelines with the Databricks UI.
● Review pipeline logs and metrics.

Introduction to Feature Engineering and

Selection with Databricks
Type: Self-paced; Cost: Free for customers
Duration: 2.5 hours

Course description: As data practitioners work on supervised machine learning

solutions, they often need to manipulate data to ensure that it is compatible with
machine learning algorithm requirements and the model is meeting its objective.
This process is known as feature engineering, and the end result is to improve the
output of machine learning solutions. Once features are engineered, data
practitioners also commonly need to determine the best way to select the best
features to use in their machine learning projects.

In this course, you’ll learn how to perform both of these tasks. This course is divided
into two modules - in the first, you’ll explore feature engineering. In the second, you’ll
explore feature selection. Both modules will start with an introduction to these
topics - what they are and why they’re used. Then, you’ll review techniques that help
data practitioners perform these tasks. Finally, you’ll have the chance to perform two
hands-on lab activities - one where you will engineer features and another where
you will select features for a fictional machine learning scenario.

Prerequisites:

● Intermediate experience with machine learning (experience using machine

Learning objectives:

● Explain popular feature engineering techniques used to improve supervised

machine learning solutions.
● Explain popular feature selection techniques used to improve supervised
machine learning solutions.
● Engineer meaningful features for use in a supervised machine learning project
using the Databricks Data Science Workspace.
● Select meaningful features for use in a supervised machine learning project
using the Databricks Data Science Workspace.
Introduction to Files in Databricks Repos
Type: Self-paced; Cost: Free for customers

Duration: 30 minutes

Course description: This course teaches how to add non-notebook files to

Databricks Repos. Learners will connect a Databricks workspace to a hosted Git
repository. Next, they will import and store non-DBC and non-notebook files using
Databricks Repos. Then, they will import a markdown file and sync changes between
a Databricks Repo and a Git provider.

Prerequisites:

● Familiarity with Git and Git commands

● Familiarity with Databricks workspaces
● Completion of of the Introduction to Databricks Repos course

Learning objectives:

● Connect a Databricks workspace to a hosted Git repository using Databricks

Repos
● Import and store non-DBC and non-notebook files using already-configured
Databricks Repos with a Git provider
● Import a markdown file imported into workspace
● Sync changes within Databricks to a Git provider

Introduction to Hyperparameter
Optimization
Type: Self-paced; Cost: Free for customers

Duration: 2 hours

Course description: In this course, you’ll learn how to apply hyperparameter tuning
strategies to optimize machine learning models for unseen data. First, you’ll work
within a balanced binary classification problem setting where you’ll use random
forest to predict the correct class. You’ll learn to tune the hyperparameters of a
random forest to improve a model. Then, you’ll again work with a binary classification
problem using random forest and a technique known as cross-validation to
generalize a model.

Prerequisites:

● Intermediate level knowledge about machine learning/machine learning

Learning objectives:

● Explain common machine learning techniques that are used to optimize

machine learning models for unseen data.
● Apply machine learning techniques to improve the fit of machine learning
models.
● Apply machine learning techniques to improve the generalization of machine
learning models.

Introduction to Jobs
Type: Self-paced; Cost: Free for customers

Duration: 30 minutes

Course description: Databricks Jobs allow users to run applications in a

non-interactive way on a cluster. Jobs allow users to manage and orchestrate
production tasks, making it simple to promote notebooks from interactive
development to scheduled workloads. In this course, you’ll explore various features
of the Jobs UI as you orchestrate a simple pipeline.

Prerequisites:

● Intermediate knowledge of Python or SQL

● Beginning knowledge of software development principles (e.g. code
modularity, code scheduling, code orchestration)
● Beginning knowledge of navigating Databricks UI
Learning objectives:

● Describe jobs and motivations for using jobs in the workflow of data
practitioners.
● Create single task jobs with a scheduled trigger.
● Orchestrate multiple notebook tasks with the Jobs UI.
● Discuss common use cases and patterns for Jobs.

Introduction to MLflow Model Registry

Type: Self-paced; Cost: Free for customers

Duration: 30 minutes

Course description: This course will introduce you to MLflow Model Registry. Model
Registry is a centralized model management tool that allows you to track metrics,
parameters, and artifacts as part of experiments, package models and reproducible
ML projects, and deploy models to batch or real-time serving platforms. You will
learn how your team can use Model Registry as a central place to share ML models,
collaborate on moving them from experimentation to testing and production, and
implement approval and governance workflows.

Prerequisites:

● Beginner-level experience with machine learning.

● Beginner-level experience with MLflow Model Tracking.
● Beginner-level experience with Python.
● Beginner-level experience with Apache Spark on Databricks.

Learning objectives:

● Describe the components and functionalities of Model Registry.

● Explain the benefits of using Model Registry for machine learning model
management.
● Describe how Model Registry fits into the ML lifecycle with Databricks
Machine Learning.
● Demonstrate how to use Model Registry to perform essential tasks in the ML
workflow.
Introduction to Multi-Task Jobs
Type: Self-paced; Cost: Free for customers

Duration: 30 minutes

Course description:After a recap of single-task jobs, as well the directed acyclic

graph (DAG) model, you will learn how to create, trigger or schedule Multi-Task jobs
in Databricks.

Prerequisites:

● Experience working with the Databricks Workspace

Learning objectives:

● Explain what Multi-Task Jobs are.

● Describe how Multi-Task Jobs fits into the Databricks ecosystem.
● Articulate how to use Multi-Task Jobs for appropriate use cases.
● Demonstrate how to use Multi-Task Jobs.

Introduction to Natural Language

Processing
Type: Self-paced; Cost: Free for customers

Duration: 4 hours

Course description: This course will introduce you to natural language processing
with Databricks. You will learn how to generate
term-frequency-inverse-document-frequency (TFIDF) vectors for your datasets
and how to perform latent semantic analysis using the Databricks Machine Learning
Runtime.

Prerequisites:

● Intermediate experience performing machine learning/data science workflows

● Intermediate experience using the Databricks Data Science Workspace to
perform machine learning workflows
Learning objectives:

● Describe foundational concepts about how latent semantic analysis is used

to analyze text data.
● Perform latent semantic analysis using the Databricks Machine Learning
Runtime with the Databricks Workspace.
● Generate TFIDF vectors to reduce the noise in a dataset being used for latent
semantic analysis in a Databricks Workspace.

Introduction to Photon
Type: Self-paced; Cost: Free for customers

Duration: 30 minutes

Course description: In this course, you’ll learn how Photon can be used to reduce
Databricks total cost of ownership (TCO) and dramatically improve query
performance. You’ll also learn best practices for when to use and not use Photon
Finally, the course will include a demonstration of a query run with and without
Photon to show improvement in query performance.

Prerequisites:

● Administrator privileges
● Introductory knowledge about the Databricks Lakehouse Platform (what the
Databricks Lakehouse Platform is, what it does, main components, etc.)

Learning objectives:

● Explain fundamental concepts about Photon on Databricks.

● Describe the benefits of enabling Photon on Databricks.
● Identify queries that would benefit from using Photon
● Describe the performance differences between a query run with and without
Photon enabled

Introduction to SQL on Databricks

Type: Self-paced; Cost: Free for customers

Duration: 1 hour
Course description: Databricks, a managed platform for running Apache Spark,
provides a premier environment for processing SQL workloads. Spark SQL is a Spark
module for structured data processing. It can act as a distributed SQL query engine,
enabling queries to run up to 100x faster on existing deployments and data. Users
with a classical SQL background can immediately begin to work in the Databricks
SQL environment. Using Spark SQL on Databricks has multiple advantages over
using SQL with traditional tools.

Prerequisites & Requirements

● Prerequisites
○ Familiarity with SQL

Learning objectives

● Identify the benefits of using Spark SQL on Databricks

● Describe basic cluster computing concepts like parallelization
● Use Spark SQL on Databricks to run basic queries
● Explain how common functions and Databricks tools can be applied to
upload, view, and visualize data

Just Enough Python for Apache Spark

Type: Premium self-paced (Free for customers); Instructor-led course: $1000 USD

Duration: 1 day (6 hours)

Course description: This 1-day course aims to help participants with or without a
programming background develop just enough experience with Python to begin
using the Apache Spark programming APIs.

Prerequisites:

● Some experience in a structured programming language such as Javascript,

C++, or R is helpful

Learning objectives:

● Navigate the Python documentation

● Employ basic programming constructs such as conditional statements and
loops
● Use function and classes from existing libraries
● Create new functions and classes
● Identify and use the primary collection types
● Understand the breadth of the language's string functions (and other misc
utility functions)
● Employ basic exception handling
● Describe and possibly employ some of the key features of functional
programming

Course agenda:

● Part 1: Getting Started with Python

○ Key Python internet resources
○ How to run Python code in various environments
● Part 2: Variable and Data Types
○ Fundamental Python concepts
○ Introduction to 4 basic data types
○ Declare and assign variables
○ Employ simple, built-in functions
○ Develop and use assert statements
● Part 3: Conditionals and Loops
○ Create a simple list
○ Iterate over a list using a for expression
○ Conditionally execute statements using if, elif, and else expressions
● Part 4: Methods, Functions, and Packages
○ Develop and use functions with and without arguments and type hints
○ Use assert statements to “unit test” functions
○ Employ the help() function to learn about modules, functions, classes,
and keywords
○ Identify differences between functions and methods
○ Import libraries
● Part 5: Collections and Classes
○ Use list methods and syntax to append, remove, or replace elements of
a list
○ Compare ranges to lists
○ Define dictionaries
○ Use list and dictionary comprehensions to efficiently transform each
element of each data structure
○ Define classes and methods
Just Enough Scala for Apache Spark
Type: Instructor-led course: $1000 USD

Duration: 1 day (6 hours)

Course description: This 1-day course aims to help participants with or without a
programming background develop just enough experience with Scala to begin using
the Apache Spark programming APIs.

Prerequisites:

● Some experience in a structured programming language such as Javascript,

C++, or R is helpful

Learning objectives:

● Navigate the Scala documentation

Course agenda:

● Coming soon.

Lakehouse with Delta Lake Deep Dive

Type: Self-paced; Cost: Free for customers

Duration: 3 hours

Course description: This course begins with an overview of the Lakehouse

architecture, and an in-depth look at key Delta Lake features and functionality that
make a Lakehouse possible. Participants will build end-to-end OLAP data pipelines
using Delta Lake for batch and streaming data. The course also discusses serving
data to end users through aggregate tables and Databricks SQL Analytics.
Throughout the course, emphasis will be placed on using data engineering best
practices with Databricks.

Prerequisites:

● Intermediate to advanced SQL skills

● Intermediate to advanced Python skills
● Beginning experience using the Spark DataFrames API
● Beginning knowledge of general data engineering concepts
● Beginning knowledge of the core features and use cases of Delta Lake

Learning objectives:

● Identify the core components of Delta Lake that make a Lakehouse possible.
● Define commonly used optimizations available in Delta Engine.
● Build end-to-end batch and streaming OLAP data pipeline using Delta Lake.
● Make data available for consumption by downstream stakeholders using
specified design patterns.
● Document data at the table level to promote data discovery and cross-team
communication.
● Apply Databricks’ recommended best practices in engineering a single source
of truth Delta architecture.

Machine Learning in Production: MLflow

and Model Deployment
Type: Instructor-led course; $1000 USD

Duration: 1 day (6 hours)

Course description: This course is separated into two main components. The first
uses MLflow as the backbone for machine learning development and production.
This includes tracking the machine learning lifecycle, packaging projects for
deployment, using the MLflow model registry, and more. The second component
looks at various production issues, the four main deployment paradigms, monitoring,
and alerting. Depending on the desires of the class, numerous electives are also
available on the various MLflow functionality and deployment scenarios.
By the end of this course, you will have built the infrastructure to manage the
development, deployment, and monitoring of models in production. This course is
taught entirely in Python.

Prerequisites:

● Intermediate experience using Python/pandas

● Familiarity with Apache Spark
● Working knowledge of machine learning and data science (scikit-learn,
TensorFlow, etc.)
● Basic familiarity with object storage, databases, and networking

Learning objectives:

● Track machine learning experiments to organize the machine learning life

cycle
● Create, organize, and package machine learning projects with a focus on
reproducibility and collaborating with a team
● Develop a generalizable way of handling machine learning models created in
and deployed to a variety of environments
● Explore the various production issues encountered in deploying and
monitoring machine learning models
● Introduce various strategies for deploying models using batch, streaming, and
real-time
● Explore solutions to drift and implement a basic retraining method and two
ways of dynamically monitoring drift

Course agenda:

● What is ML deployment?
● The Four Deployment Paradigms
● Deployment Requirements
● Deployment Architectures
● Other Issues

Migrating SAS Procedures to Databricks

Type: Self-paced; Cost: Free for customers

Duration: 30 minutes
Course description: This course will enable experienced SAS developers to quickly
learn how to translate familiar SAS statements and functions into code that can be
run on Databricks. It begins with an introduction to the Databricks environment and
the different approaches to coding in Databricks, followed by an overview of how
SAS PROC and DATA steps can be performed in Databricks.You will learn about how
you can use Spark SQL, PySpark, and other tools to read .sas7bdat files and perform
common operations. Finally, you will see code examples and gain hands-on practice
performing some of the most common SAS operations in Databricks.

Prerequisites:

● Intermediate to advanced SAS programming experience

● Beginning knowledge of Python programming
● Beginning-level experience with SQL

Learning objectives:

● Read data stored in .sas7bdat files using Spark SQL and PySpark.
● Explain the conceptual and syntactical relationships between SAS DATA and
PROC statements and their correlaries on Databricks.
● Learn how Python can be leveraged to augment ANSI SQL to create reusable
Spark SQL code.
● Translate common PROC functions to Databricks.
● Translate common DATA steps to Databricks.

Natural Language Processing at Scale with

Databricks
Type: Self-paced; Cost: Free for customers

Duration: 5 hours

Course description: This five-hour course will teach you how to do natural language
processing at scale on Databricks. You will apply libraries such as NLTK and Gensim
in a distributed setting as well as SparkML/MLlib to solve classification, sentiment
analysis, and text wrangling tasks. You will learn how to remove stop words, when to
lemmatize vs stem your tokens, and how to generate
term-frequency-inverse-document-frequency (TFIDF) vectors for your dataset. You
will also use dimensionality reduction techniques to visualize word embeddings with
Tensorboard and apply and visualize basic vector arithmetic to embeddings.

Prerequisites:

● Experience working with PySpark DataFrames

● Mastery of concepts presented in the Databricks Academy "Apache Spark
Programming" course
● Mastery of concepts presented in the Databricks Academy "Scalable Machine
Learning with Apache Spark" course

Learning objectives:

● Explain the motivation behind using Natural Language Processing to analyze

data.
● Identify distributed Natural Language Processing libraries commonly used
when analyzing data.
● Perform a series of Natural Language Processing workflows in the Databricks
Data Science Workspace

Optimizing Apache Spark on Databricks

Type: Self-Paced (free for customers); Instructor-led ($1000 USD)

Duration: 1 day (6 hours)

Course description: In this course, students will explore five key problems that
represent the vast majority of performance problems in an Apache Spark
application: Skew, Spill, Shuffle, Storage, and Serialization. With each of these topics,
we explore examples that demonstrate how these problems are introduced, how to
diagnose these problems with tools like the Spark UI, and conclude by discussing
mitigation strategies for each of these problems.

The course will also address concepts including:

● Key ingestion concepts

● New features like Adaptive Query Execution and Dynamic Partition Pruning
● Configuring clusters for optimal performance

Prerequisites:

● Intermediate to advanced programming experience in Python or Scala

● Hands-on experience developing Apache Spark applications

Learning objectives:

● Articulate how the five most common performance problems in a Spark

application can be mitigated to achieve better application performance.
● Summarize some of the most common performance problems associated
with data ingestion and how to mitigate them.
● Articulate how new features in Spark 3.0 can be employed to mitigate
performance problems in your Spark applications.
● Configure a Spark cluster for maximum performance given specific job
requirements and while considering a multitude of other factors.

Course agenda:

● Day 1 AM
○ The Five Most Common Performance Problems
■ Introduction / Benchmarking
■ Skew
■ Spill
■ Shuffle
■ Storage
■ Serialization
● Day 1 PM
○ Key Ingestion Concepts
■ Ingestion Basics
■ Predicate Push Downs
■ Disk Partitioning
■ Z-Ordering
■ Bucketing
○ Optimizing with AQE and DPP
■ Tuning Shuffle Partitions
■ Join Optimizations
■ Skew Join Optimizations
■ Dynamic Partition Pruning
○ Designing Clusters for High Performance
■ Designing Clusters for High Performance
■ Cluster Configurations Scenarios
■ Designing Clusters Breakout
● Optional review topics
○ Introduction to Spark Architecture
○ Spark UI Demo

Propagating Changes with Delta Change

Data Feed
Type: Self-paced; Cost: Free for customers

Duration: 1 hour

Course description: A Delta change data feed represents row-level changes

between versions of a Delta table. When enabled on a Delta table, the runtime
records “change events” for all the data written into the table. This includes the row
data along with metadata indicating whether the specified row was inserted,
deleted, or updated. In this course, we'll examine some of the motivations and use
cases for this feature and see it in action.

Prerequisites:

● Basic knowledge of Spark Structured Streaming APIs

● Basic knowledge of Delta Lake

Learning objectives:

● Describe how Delta Change Data Feed emits change data records.
● Use appropriate syntax and settings to set up Change Data Feed.
● Propagate inserts, updates, and deletes with Change Data Feed.

Quick Reference: CI/CD

Type: Self-paced, Cost: Free for customers

Duration: 30 minutes

Course description: This quick-reference provides an overview of fundamental

concepts behind CI/CD. While the Databricks tools and integrations mentioned in
this course can be used by DevOps teams for CI/CD, this course was designed to
summarize what happens during each stage of a CI/CD pipeline (not provide a
technical how-to into each of these stages). Future courses will dive into each of
these stages in greater detail. Note: We will use Jenkins as an example automation
system in this course.

Prerequisites:

● Beginning-level experience with CICD, DevOps and/or the software

development lifecycle (not necessarily on Databricks)

Learning objectives:

● Summarize each stage in a traditional CI/CD pipeline.

● Outline the steps in configuring the Jenkins automation agent for use in
CI/CD.

Quick Reference: Databricks Workspace

User Interface
Type: Self-paced; Cost: Free for customers

Duration: 10 minutes

Course description: This is a short, ten-minute introduction to the Databricks

Collaborative Data Science Workspace (Workspace). If you are new to Databricks,
we recommend taking this course to familiarize yourself with the layout of the
Databricks Workspace.

Prerequisites:

● Beginning-level knowledge of big data and data science concepts.

● Beginning-level knowledge of the functionality within the Unified Data
Analytics Platform.

Learning objectives:

● Define vocabulary terms relevant to data practitioners working in the

Workspace.
● Navigate to major components of the Workspace.
Quick Reference: Managing Databricks
Notebooks with the Databricks
Workspace
Type: Self-paced; Cost: Free for customers

Duration: 5 minutes

Course description: This course includes a series of short videos that show how to
perform tasks to manage Databricks notebooks including creating, opening, deleting,
and distributing notebooks. It also includes information on attaching and detaching
notebooks to clusters and controlling access to notebooks. This course does not
cover how to analyze data with data using notebooks.

Prerequisites:

● Beginning experience with the Databricks Unified Data Analytics Platform

helpful but not required.
● Beginning knowledge about data science and big data concepts helpful but
not required.

Learning objectives:

● Execute basic Databricks notebooks management tasks in the Collaborative

Data Science Workspace.

Quick Reference: Relational Entities on

Databricks
Type: Self-paced; Cost: Free for customers

Duration: 1 hour

Course description: While the syntax for creating and working with databases,
tables, and views will be familiar to most SQL users, some default behaviors may
surprise users new to Databricks. In this short course, you’ll learn how to create
databases, tables, and views on Databricks. Special attention will be given to
differences in scope and persistence for these various entities, allowing any user
that will be responsible for creating or managing databases, tables, or views to make
informed decisions for their organization.

Prerequisites:

● Beginning knowledge of SQL

● Beginning knowledge of loading and interacting with data files in Spark

Learning objectives:

● Write basic queries that create databases, tables, and views.

● Describe how relational entities are managed by the catalog on Databricks.
● Describe how the LOCATION keyword changes the default behavior for
database contents.
● Describe the differences in syntax and performance for managed and
unmanaged tables.
● Describe the differences in scope and persistence between views, temp
views, and global temp views.

Quick Reference: Spark Architecture

Type: Self-paced; Cost: Free for customers

Duration: 30 minutes

Course description: Apache Spark™ is a unified analytics engine for large scale data
processing known for its speed, ease and breadth of use, ability to access diverse
data sources, and APIs built to support a wide range of use-cases. Databricks builds
on top of Spark and adds many performance and security enhancements. This
course is meant to provide an overview of Spark’s internal architecture.

Prerequisites:

● Beginning knowledge of big data and data science concepts.

Learning objectives:

● Describe basic Spark architecture and define terminology such as “driver”

and “executor”
● Explain how parallelization allows Spark to improve speed and scalability of an
application
● Describe lazy evaluation and how it relates to pipelining
● Identify high-level events for each stage in the Optimization process

Scalable Deep Learning with TensorFlow

and Apache Spark™
Type: Instructor-led and self-paced; Cost: Instructor-led $1500 USD; self-paced -
free for customers

Duration: 2 days (12 hours)

Course description: This course starts with the basics of the tf.keras API including
defining model architectures, optimizers, and saving/loading models. You then
implement more advanced concepts such as callbacks, regularization, TensorBoard,
and activation functions. After training your models, you will integrate the MLflow
tracking API to reproduce and version your experiments. You will also apply model
interpretability libraries such as LIME and SHAP to understand how the network
generates predictions. You will also learn about various Convolutional Neural
Networks (CNNs) architectures and use them as a basis for transfer learning to
reduce model training time.

Substantial class time is spent on scaling your deep learning applications, from
distributed inference with pandas UDFs to distributed hyperparameter search with
Hyperopt to distributed model training with Horovod. This course is taught fully in
Python.

Prerequisites:

● Intermediate experience with Python/pandas

● Familiarity with machine learning concepts
● Experience with PySpark

Learning objectives:

● Build deep learning models using Keras/TensorFlow

● Scale the following:
● Model inference with pandas UDFs & pandas function API
● Hyperparameter tuning with HyperOpt
● Training of distributed TensorFlow models with Horovod
● Track, version, and reproduce experiments using MLflow
● Apply model interpretability libraries to understand & visualize model
predictions
● Use CNNs (convolutional neural networks) and perform transfer learning &
data augmentation to improve model performance
● Deploy deep learning models

Course agenda (instructor-led):

● Day 1 AM
○ Spark Review
○ Linear Regression
○ Keras
○ Keras lab
● Day 1 PM
○ Advanced Keras
○ Advanced Keras lab
○ MLflow
○ MLflow lab
○ Hyperopt
○ Hyperopt lab
○ Horovod
● Day 2 AM
○ Horovod Petastorm
○ Horovod lab
○ Moel interpretability
○ CNNs
● Day 2 PM
○ CNNs
○ SHAP for CNNs
○ Transfer learning
○ Data augmentation
○ Transfer learning lab
○ Model serving
○ Generative adversarial networks
○ Best practices
Scalable Machine Learning with Apache
Spark
Type: Instructor-led and self-paced; Cost: Instructor-led $1500 USD; self-paced -
free for customers

Duration: 2 days (12 hours)

Course description: This course guides students through the process of building
machine learning solutions using Spark. You will build and tune ML models with
SparkML using transformers, estimators, and pipelines. This course highlights some
of the key differences between SparkML and single-node libraries such as
scikit-learn. Furthermore, you will reproduce your experiments and version your
models using MLflow.You will also integrate 3rd party libraries into Spark workloads,
such as XGBoost. In addition, you will leverage Spark to scale inference of
single-node models and parallelize hyperparameter tuning.

Prerequisites:

● Intermediate experience with Python Beginning experience with the PySpark

DataFrame API (or have taken the Apache Spark Programming with Databricks
class)
● Working knowledge of machine learning and data science

Learning objectives:

● Create data processing pipelines with Spark.

● Build and tune machine learning models with Spark ML.
● Track, version, and deploy models with MLflow.
● Perform distributed hyperparameter tuning with Hyperopt.
● Use Spark to scale the inference of single-node models.

Course agenda (instructor-led):

● Day 1 AM
○ Apache Spark Overview
○ ML Overview
○ Data Cleansing
○ Data Exploration
○ Linear Regression I
● Day 1 PM
○ Linear Regression I Continued
○ Linear Regression II
○ MLflow Tracking
○ MLflow Model Registry
○ MLflow Lab
● Day 2 AM
○ Decision Trees
○ Hyperparameter Tuning
○ Hyperopt
● Day 2 PM
○ Hyperopt Continued
○ MLlib Deployment Options
○ XGBoost
○ Inference with Pandas UDFs
○ Training with Pandas UDFs
○ Koalas
○ Capstone Project

Service Overview: Databricks Data

Science & Engineering Workspace
Type: Self-paced; Cost: Free for customers

Duration: 2 hours

Course description: The Databricks Data Science and Engineering Workspace

(Workspace) provides a collaborative analytics platform to help data practitioners
get the most out of Databricks when it comes to data science and engineering
tasks. This course (formerly known as Databricks Data Science & Engineering
Workspace) guides practitioners through fundamental Workspace concepts and
components necessary to achieve a basic development workflow.

Prerequisites & Requirements

● Prerequisites
○ None.
Learning objectives

● Add a user to your Databricks environment and provide access to Databricks

SQL.
● Create and start a SQL endpoint to provide a computation resource for the
user.
● Configure access to the default database for the user to run SQL commands
on using the SQL endpoint.

SQL Coding Challenges

Type: Self-paced; Cost: Free for customers

Duration: 1 hour

Course description: Taking this course will familiarize you with the content and
format of the Associate SQL Analyst Accreditation, as well as provide you some
practical exercises that you can use to improve your skills or cement newly learned
concepts. We recommend that you complete Fundamentals of SQL on Databricks
and Applications of SQL on Databricks before using this guide.

Prerequisites:

● Intermediate-level ability with SQL

Learning objectives:

● Describe the format and scope of the SQL analyst accreditation

● Identify the scope of knowledge-based and practical topics covered
● Complete practical exercises to practice applying SQL skills on Databricks

Structured Streaming
Type: Self-paced; Cost: Free for customers

Duration:

Course description: This hands-on self-paced training course targets data

engineers who want to process big data using Apache Spark™ Structured Streaming.
The course is a series of four self-paced lessons. Each lesson includes hands-on
exercises. The course contains Databricks notebooks for both Azure Databricks and
AWS Databricks; you can run the course on either platform.

Prerequisites:

● Completion of Apache Spark Programming on Databricks course strongly

encouraged

Learning objectives:

● Use the interactive Databricks notebook environment

● Ingest streaming log file data
● Aggregate small batches of data with time windows
● Stream data from a Kafka connection
● Use Structured Streaming in conjunction with Databricks Delta
● Visualize streaming live data
● Use Structured Streaming to analyze streaming Twitter data

Tracking Experiments with MLflow

Type: Self-paced; Cost: Free for customers

Duration: 2 hours

Course description: In this course, we’ll show you how to design an MLflow
experiment to identify the best machine model for deployment. This course is the
second in a series of three courses developed to show you how to use Databricks to
work with a single data set from experimentation to production-scale machine
learning model deployment. The other courses in this series include:

● Data Science on Databricks: The Bias-Variance Tradeoff

● Deploying a Machine Learning Project with MLflow Projects

Prerequisites:

● Beginning-level experience running data science workflows in the Databricks

Workspace
● Beginner-level experience with Apache Spark
● Intermediate-level experience with the Scipy Numerical Stack

Learning objectives:
● Create and explore an augmented sample from user event and profile data.
● Design an MLflow experiment and write notebook-based software to run the
experiment to assess various linear models.
● Examine experimental results to decide which model to develop for
production.

What’s New in Apache Spark 3.0

Type: Self-paced; Cost: Free for customers

Duration: 30 minutes

Course description: This course was created to teach Databricks users about the
major improvements to Spark in the 3.0 release. It will give an overview of new
features meant to improve performance and usability. Students will also learn about
backwards compatibility with 2.x and some of the considerations required for
updating to Spark 3.0.

Prerequisites:

● Familiarity with Spark 2.x

Learning objectives:

● Describe major improvements to performance in Spark 3.0

● Identify major usability improvements in Spark 3.0
● Recognize relevant compatibility considerations for migrating to Spark 3.0

Credential descriptions

Azure Databricks Certified Associate

Platform Administrator
Type: Self-paced; Cost: Free for customers

Duration: 2 hours
Course description: The Azure Databricks Certified Associate Platform Administrator
certification exam assesses an understanding of network infrastructure and security
with Databricks, including workspace deployment, Azure cloud concepts, and
network security. The exam also assesses the understanding of identity and access
on Azure Databricks, including identity management, workspace access control,
data access control, and fine-grained security. In addition, the exam assesses
cluster configuration and usage management. Lastly, developer tools and
automation processes are assessed.

Prerequisites:

● The minimally qualified candidate should have:

○ have an intermediate understanding of network infrastructure and
security, including: workspace deployment, Azure cloud concepts,
network security
○ have a complete understanding of identity and access configurations,
including: identity management, workspace access control, data
access control, fine-grained security using SQL
○ have an intermediate understanding of cluster usage, including: clust
configuration and usage management
○ have a basic understanding of automation, including: developer tools,
automation processes

Databricks Certified Associate Developer

for Apache Spark 3.0
Type: Certification; Cost: $200 USD

Duration: 2 hours

Description: The Databricks Certified Associate Developer for Apache Spark 3.0
certification exam assesses the understanding of the Spark DataFrame API and the
ability to apply the Spark DataFrame API to complete basic data manipulation tasks
within a Spark session. These tasks include selecting, renaming and manipulating
columns; filtering, dropping, sorting, and aggregating rows; handling missing data;
combining, reading, writing and partitioning DataFrames with schemas; and working
with UDFs and Spark SQL functions. In addition, the exam will assess the basics of
the Spark architecture like execution/deployment modes, the execution hierarchy,
fault tolerance, garbage collection, and broadcasting.

Prerequisites:

● The minimally qualified candidate should:

○ have a basic understanding of the Spark architecture, including
Adaptive Query Execution
○ be able to apply the Spark DataFrame API to complete individual data
manipulation task, including:
○ selecting, renaming and manipulating columns
○ filtering, dropping, sorting, and aggregating rows
○ joining, reading, writing and partitioning DataFrames
○ working with UDFs and Spark SQL functions

It is expected that developers that have been using the Spark DataFrame API
for six months or more should be able to pass this certification exam.

While it will not be explicitly tested, the candidate must have a working
knowledge of either Python or Scala. The exam is available in both languages.

Databricks Certified Professional Data

Engineer
Type: Self-paced; Cost: Free for customers

Duration: 2 hours

Course description: The Databricks Certified Professional Data Engineer certification

exam assesses an individual’s ability to use Databricks to perform common data
engineering tasks. This includes an understanding of the Databricks platform and
developer tools like Apache Spark, Delta Lake, MLflow, and the Databricks CLI and
REST API. It also assesses the ability to build optimized and cleaned ETL pipelines.
Additionally, modeling data into a Lakehouse using knowledge of general data
modeling concepts will also be assessed. Finally, ensuring that data pipelines are
secure, reliable, monitored, and tested before deployment will also be included in
this exam. Individuals who pass this certification exam can be expected to complete
data engineering tasks using Databricks and its associated tools.
Prerequisites:

● The minimally qualified candidate should have:

○ a complete understanding of the basics of machine learning, including:
the bias-variance tradeoff, in-sample vs. our of sample data, categories
of machine learning, applied statistics concepts
○ an intermediate understanding of the steps in the machine learning
lifecycle, including: data preparation, feature engineering, model
training, model selection and model production, interpreting models
○ a complete understanding of basic machine learning algorithms and
techniques, including: linear, logistic, and regularized regression,
tree-based models like decision trees, random forest and gradient
boosted trees, unsupervised techniques like K-means and PCA,
specific algorithms like ALS for recommendation and isolation forests
for outlier detection
○ a complete understanding of the basics of machine learning model
management like logging and model organization with MLflow

Databricks Certified Professional Data

Scientist
Type: Self-paced; Cost: Free for customers

Duration: 2 hours

Course description: The Databricks Certified Professional Data Scientist certification

exam assesses the understanding of the basics of machine learning and the steps in
the machine learning lifecycle, including data preparation, feature engineering, the
training of models, model selection, interpreting models, and the production of
models. The exam also assesses the understanding of basic machine learning
algorithms and techniques, including linear regression, logistic regression,
regularization, decision trees, tree-based ensembles, basic clustering algorithms,
and matrix factorization techniques. The basics of model management with MLflow,
like logging and model organization, are also assessed.

Prerequisites:

● The minimally qualified candidate should have:

Fundamentals of the Databricks

Lakehouse Platform Accreditation
Type: Self-paced; Cost: Free for customers

Duration: 30 minutes

Accreditation description: This is a 30-minute assessment that will test your

knowledge about fundamental concepts related to the Databricks Lakehouse
Platform. Questions will assess how well you know about the platform in general, how
familiar you are with the individual components of the platform, and your ability to
describe how the platform helps organizations accomplish their data engineering,
data science/machine learning, and business/SQL analytics use cases. Please note
that this assessment will not test your ability to perform tasks using Databricks
functionality. Instead, it will test how well you can explain components of the
platform and how they fit together.

After successfully completing this assessment, you will be awarded a Databricks

Lakehouse Platform badge.

This accreditation is the beginning step in most of the Databricks Academy learning
plans - SQL Analysts, Data Scientists, Data Engineers, and Platform Administrators.
Business leaders are also welcome to take this assessment.
Prerequisites:

● We recommend that you take the following courses to prepare for this
accreditation exam:
○ What is the Databricks Lakehouse Platform?
○ What are Enterprise Data Management Systems? (particularly the
section on Lakehouse architecture)
○ What is Delta Lake?
○ What is Databricks SQL?
○ What is Databricks Machine Learning?

SQL Analyst Accreditation

Type: Self-paced; Cost: Free for customers

Duration: 1 hour

Accreditation description: In this 1-hour accreditation exam, you will demonstrate

your ability to use Apache Spark SQL to query, transform, and present data.

Prerequisites:

● Intermediate experience with SQL.

Databricks Academy Course Catalog
No ratings yet
Databricks Academy Course Catalog
58 pages
Course Catalog
No ratings yet
Course Catalog
64 pages
Course Catalog
No ratings yet
Course Catalog
57 pages
Databricks Academy Self Paced Content
No ratings yet
Databricks Academy Self Paced Content
18 pages
Data Engineering With Databricks
No ratings yet
Data Engineering With Databricks
5 pages
Road Map 1741960074
No ratings yet
Road Map 1741960074
24 pages
Big Data - Road Map
No ratings yet
Big Data - Road Map
22 pages
Data Engineering With Databricks Da
100% (3)
Data Engineering With Databricks Da
232 pages
Data Bricks S
No ratings yet
Data Bricks S
18 pages
Edureka Training - Data Engineer Masters Program
No ratings yet
Edureka Training - Data Engineer Masters Program
49 pages
Get Started With Databricks For Machine Learning
No ratings yet
Get Started With Databricks For Machine Learning
85 pages
Data Engineering Databricks
No ratings yet
Data Engineering Databricks
139 pages
Data Engineering With Databricks
100% (2)
Data Engineering With Databricks
63 pages
Data Analysis With Databricks Version 2
No ratings yet
Data Analysis With Databricks Version 2
137 pages
Data Engineering With Databricks
No ratings yet
Data Engineering With Databricks
11 pages
Data Analysis With Databricks
75% (4)
Data Analysis With Databricks
80 pages
Data Analysis With Databricks Version 2
No ratings yet
Data Analysis With Databricks Version 2
137 pages
Bigdata Hadoop Spark - Python
No ratings yet
Bigdata Hadoop Spark - Python
8 pages
B2. Introduction To Big Data With Spark and Hadoop - Coursera
No ratings yet
B2. Introduction To Big Data With Spark and Hadoop - Coursera
12 pages
Complete Step-By-Step Roadmap To Learn Data Engineering in 2025
No ratings yet
Complete Step-By-Step Roadmap To Learn Data Engineering in 2025
13 pages
Databricks Training
100% (1)
Databricks Training
4 pages
Spark Summit: June 2014
No ratings yet
Spark Summit: June 2014
32 pages
Databricks Webinar Presentation
No ratings yet
Databricks Webinar Presentation
9 pages
Advanced Databricks Curriculum
No ratings yet
Advanced Databricks Curriculum
2 pages
Data Lake Analytics Program External 15 Apr
No ratings yet
Data Lake Analytics Program External 15 Apr
13 pages
Roadmap To Become Data Engineer in 2024
No ratings yet
Roadmap To Become Data Engineer in 2024
8 pages
Big Data With Artificial Intelligence and Cloud
No ratings yet
Big Data With Artificial Intelligence and Cloud
7 pages
Databricks Setup for Beginners
No ratings yet
Databricks Setup for Beginners
13 pages
Databricks Certification Preparation Associate DE
50% (2)
Databricks Certification Preparation Associate DE
65 pages
Ultimate Data Engineering Masters Program v1
No ratings yet
Ultimate Data Engineering Masters Program v1
10 pages
Azure Databricks & Spark Course
No ratings yet
Azure Databricks & Spark Course
9 pages
Data Engineer Master Program v2
No ratings yet
Data Engineer Master Program v2
27 pages
Simplifying Data Engineering Databricks
100% (1)
Simplifying Data Engineering Databricks
20 pages
Big Data Training in Chennai - Big Data Course in Chennai
No ratings yet
Big Data Training in Chennai - Big Data Course in Chennai
1 page
Data Science Bootcamp Brochure
No ratings yet
Data Science Bootcamp Brochure
20 pages
Data Engineering Brochure New
No ratings yet
Data Engineering Brochure New
33 pages
Big Data Technology E1UJ502B
No ratings yet
Big Data Technology E1UJ502B
11 pages
Big Data Analytics
No ratings yet
Big Data Analytics
2 pages
B. NoSQL, Big Data, and Spark Foundations - Coursera
No ratings yet
B. NoSQL, Big Data, and Spark Foundations - Coursera
7 pages
Databricks Guide
No ratings yet
Databricks Guide
31 pages
Learn Well Technocraft: Hadoop/Big Data Syllabus
100% (1)
Learn Well Technocraft: Hadoop/Big Data Syllabus
12 pages
Data Science Course Brochure
No ratings yet
Data Science Course Brochure
20 pages
Python and Pyspark With Databricks, With Azure Project
No ratings yet
Python and Pyspark With Databricks, With Azure Project
9 pages
Deploy Workloads With Lakeflow Jobs
No ratings yet
Deploy Workloads With Lakeflow Jobs
91 pages
Data Engineering Brochure FXSr63lN9T
No ratings yet
Data Engineering Brochure FXSr63lN9T
14 pages
Edukuron Data Engineering
No ratings yet
Edukuron Data Engineering
10 pages
Introduction To Databricks A Beginneers Guide
No ratings yet
Introduction To Databricks A Beginneers Guide
20 pages
7 Steps For A Developer To Learn Apache Spark
No ratings yet
7 Steps For A Developer To Learn Apache Spark
30 pages
Introduction To Lambda Architecture
No ratings yet
Introduction To Lambda Architecture
4 pages
DA Skill
No ratings yet
DA Skill
1 page
Databricks Masters Program Curriculum
No ratings yet
Databricks Masters Program Curriculum
2 pages
BIG DATA ANALYTICS - Syllabus
No ratings yet
BIG DATA ANALYTICS - Syllabus
4 pages
Data Engineering Bootcamp for All
No ratings yet
Data Engineering Bootcamp for All
12 pages
Big Data My Studies
No ratings yet
Big Data My Studies
28 pages
Databricks Developer Roadmap Guide
No ratings yet
Databricks Developer Roadmap Guide
2 pages
Big Data Syllabus
No ratings yet
Big Data Syllabus
6 pages
Credit Valuation Adjustment
No ratings yet
Credit Valuation Adjustment
23 pages
ION Commodities PB Openlink
No ratings yet
ION Commodities PB Openlink
4 pages
Daml Enterprise2.5.1
No ratings yet
Daml Enterprise2.5.1
1,644 pages
AIMA Journal - Edition 117
No ratings yet
AIMA Journal - Edition 117
88 pages
Solace Pubsub Platform Datasheet
No ratings yet
Solace Pubsub Platform Datasheet
6 pages
Ey Actuarial Data Management Brochure
100% (1)
Ey Actuarial Data Management Brochure
11 pages
PSP Movies List
No ratings yet
PSP Movies List
10 pages
Academic Insights for Educators
No ratings yet
Academic Insights for Educators
7 pages
Lesson Plan 26
No ratings yet
Lesson Plan 26
4 pages
SSC Selection Post 2024 (Graduate Level) Official Paper (Held On - 26 Jun, 2024 Shift 4)
No ratings yet
SSC Selection Post 2024 (Graduate Level) Official Paper (Held On - 26 Jun, 2024 Shift 4)
38 pages
SSE 204 Official Complete Module
No ratings yet
SSE 204 Official Complete Module
173 pages
Ricardian Theory of Rent
No ratings yet
Ricardian Theory of Rent
3 pages
Chapter - 29: Inventory Management
No ratings yet
Chapter - 29: Inventory Management
10 pages
Kids' Tale: Fly Saves the River
No ratings yet
Kids' Tale: Fly Saves the River
4 pages
Answers
No ratings yet
Answers
9 pages
08 - 6th Grade Math Statistics and Probability Digital Test Printable & SELF-GRADING
No ratings yet
08 - 6th Grade Math Statistics and Probability Digital Test Printable & SELF-GRADING
20 pages
Business Environment: Presentation On Multinational Corporations
No ratings yet
Business Environment: Presentation On Multinational Corporations
19 pages
01 NE40E-X & NE40E-XA Hardware Description-Online
No ratings yet
01 NE40E-X & NE40E-XA Hardware Description-Online
93 pages
SchoolPhonics 4 DailyLessonPlans
No ratings yet
SchoolPhonics 4 DailyLessonPlans
48 pages
Soal Yii2 201016 - PRE
No ratings yet
Soal Yii2 201016 - PRE
12 pages
Benteler Logistics Manual
No ratings yet
Benteler Logistics Manual
100 pages
HY - G - XI Revision Test-2-Answer Key - 2025-26-1
No ratings yet
HY - G - XI Revision Test-2-Answer Key - 2025-26-1
5 pages
KCSE 2024 Math Predictions
No ratings yet
KCSE 2024 Math Predictions
126 pages
Gurukul Edutech Training Programs
No ratings yet
Gurukul Edutech Training Programs
3 pages
IG Religious Studies Paper 2 Exemplar Responses
No ratings yet
IG Religious Studies Paper 2 Exemplar Responses
45 pages
Prophetic Worship D... by Helen Calder
No ratings yet
Prophetic Worship D... by Helen Calder
149 pages
The Contemporary World: Defining Globalization
No ratings yet
The Contemporary World: Defining Globalization
30 pages
Journal Entry
No ratings yet
Journal Entry
5 pages
Guidelines and Mechanics For MR
No ratings yet
Guidelines and Mechanics For MR
1 page
Bhs - Inggris Bab 6
No ratings yet
Bhs - Inggris Bab 6
4 pages
AMO Course Registration Form 2020
No ratings yet
AMO Course Registration Form 2020
2 pages
MFD Carbon Dioxide (CO2) Systems Checklist
No ratings yet
MFD Carbon Dioxide (CO2) Systems Checklist
1 page
How African Americans Contributed To The Early Labor Movement and Shaped Pre-Civil War Labor Rights
No ratings yet
How African Americans Contributed To The Early Labor Movement and Shaped Pre-Civil War Labor Rights
36 pages
Walden: Embrace Simplicity & Nature
No ratings yet
Walden: Embrace Simplicity & Nature
8 pages
SLA for IT Service Provision
No ratings yet
SLA for IT Service Provision
6 pages
Urinary Tract Infection in Pregnancy
No ratings yet
Urinary Tract Infection in Pregnancy
49 pages