Q4 2022 Course
Catalog
UPDATED: JANUARY 2022
Welcome to the Databricks Academy 6
About the Databricks Academy 6
About this course catalog 6
What’s new this quarter 7
What’s being retired this quarter 7
Databricks Academy offerings 7
Training 7
Credentials 9
Learning paths 9
Business leaders 10
Platform administrators 11
SQL analysts 13
Data scientists 14
Data engineers 15
Course descriptions 16
Apache Spark Programming with Databricks 16
Applications of SQL on Databricks 17
Applied Data Engineering Solutions with Databricks 18
AWS Databricks Cloud Architecture and System Integration Fundamentals 19
AWS Databricks Cluster Usage Management 20
AWS Databricks Data Access Management 20
AWS Databricks Identity Access Management 21
AWS Databricks Security Fundamentals 22
AWS Databricks SQL Administration 23
AWS Databricks Workspace Deployment 23
Azure Databricks Cloud Architecture and System Integration Fundamentals 24
Azure Databricks Cluster Usage Management 25
Azure Databricks Data Access Management 26
Azure Databricks Identity Access Management 26
Azure Databricks Security Fundamentals 27
Azure Databricks SQL Administration 28
Azure Databricks Workspace Deployment 29
Cert Prep Course for the Databricks Certified Associate Developer for Apache Spark
Exam 29
Configuring Workspace Access Control Lists (ACLs) 30
Data Engineering with Databricks 31
Data Science on Databricks Rapid Start 33
Data Science on Databricks - Bias-Variance Tradeoff 33
Data Visualization with Databricks SQL 34
Databricks Command Line Interface (CLI) Fundamentals 35
Databricks Datadog Integration 36
Databricks on Google Cloud: Architecture and Security Fundamentals 36
Databricks on Google Cloud: Cloud Architecture and System Integration 37
Databricks on Google Cloud: Cluster Usage Management 38
Databricks on Google Cloud: Workspace Deployment 39
Databricks with R 39
Databricks Workspace Fundamentals for Business Analytics 40
Delta Lake Rapid Start with Python 40
Delta Lake Rapid Start with Spark SQL 41
Deploying a Machine Learning Project with MLflow Projects 42
Easy ETL with Auto Loader 43
Enterprise Architecture with Databricks 43
Fundamentals of Big Data 44
Fundamentals of Cloud Computing 45
Fundamentals of Databricks Machine Learning 46
Fundamentals of Databricks SQL 46
Fundamentals of Delta Lake 47
Fundamentals of Enterprise Data Management Systems 48
Fundamentals of Lakehouse Architecture 49
Fundamentals of Machine Learning 49
Fundamentals of Structured Streaming 50
Fundamentals of the Databricks Lakehouse Platform 51
Google Cloud Fundamentals 52
How to Code-Along with Self-Paced Courses 53
How to Manage Clusters in Databricks 53
Introduction to Apache Spark Architecture 54
Introduction to Applied Linear Models 54
Introduction to Applied Statistics 55
Introduction to Applied Tree-Based Models 56
Introduction to Applied Unsupervised Learning 56
Introduction to AutoML 57
Introduction to Cloning with Delta Lake 58
Introduction to Databricks Connect 58
Introduction to Databricks Machine Learning 59
Introduction to Databricks Repos 60
Introduction to Delta Lake 60
Introduction to Delta Live Tables 61
Introduction to Feature Engineering and Selection with Databricks 62
Introduction to Hyperparameter Optimization 63
Introduction to Jobs 63
Introduction to MLflow Model Registry 64
Introduction to Multi-Task Jobs 65
Introduction to Natural Language Processing 65
Introduction to SQL on Databricks 66
Just Enough Python for Apache Spark 67
Just Enough Scala for Apache Spark 68
Lakehouse with Delta Lake Deep Dive 69
Machine Learning in Production: MLflow and Model Deployment 70
Migrating SAS Procedures to Databricks 71
Natural Language Processing at Scale with Databricks 72
Optimizing Apache Spark on Databricks 72
Propagating Changes with Delta Change Data Feed 74
Quick Reference: CI/CD 75
Quick Reference: Databricks Workspace User Interface 75
Quick Reference: Managing Databricks Notebooks with the Databricks Workspace 76
Quick Reference: Relational Entities on Databricks 76
Quick Reference: Spark Architecture 77
Scalable Deep Learning with TensorFlow and Apache Spark™ 78
Scalable Machine Learning with Apache Spark 80
Setting up Databricks SQL 81
SQL Coding Challenges 82
Structured Streaming 82
Tracking Experiments with MLflow 83
What’s New in Apache Spark 3.0 84
Credential descriptions 84
Associate SQL Analyst Accreditation 84
Databricks Certified Associate Developer for Apache Spark 2.4 85
Databricks Certified Associate Developer for Apache Spark 3.0 86
Databricks Certified Associate ML Practitioner for Apache Spark 2.4 87
Databricks Certified Professional Data Scientist 88
Fundamentals of the Databricks Lakehouse Platform 89
Welcome to the Databricks Academy
About the Databricks Academy
Our mission at the Databricks Academy is to help our customers achieve their big
data and analytics goals through engaging learning experiences. At Databricks,
professionals from a wide variety of disciplines come together and use modern
pedagogical techniques to develop training that showcases Databricks best
practices. We offer our customers a wide range of materials to meet their diverse
training needs -- whether they want to study at home, participate in a traditional
classroom setting, or engage with other Databricks users in public online courses --
to grow professionally with cloud-native skills.
About this course catalog
This course catalog is broken into the following categories:
● Welcome to the Databricks Academy: information about the Databricks
Academy and the students we serve
● What’s new this quarter: a list of the recently released training materials
● Databricks Academy offerings: an explanation of the types of learning
content we offer
● Course descriptions: short descriptions for each course available through
the Databricks Academy
What’s new this quarter
November 2021
ELT with Spark SQL
How to Ingest Data for Databricks SQL
Introduction to Databricks Data Science & Engineering Workspace
Scaling Machine Learning Pipelines
December 2021
Basic SQL on Databricks SQL
Getting Started with Databricks SQL
Introduction to MLflow Tracking
New Capability Overview: Feature Store
What’s being retired this quarter
November 9: Quick Reference: Databricks Workspace User Interface (content now updated
and inside Introduction to Databricks Data Science & Engineering Workspace)
November 9: Quick Reference: Managing Databricks Notebooks with the Databricks
Workspace (content now updated and inside Introduction to Databricks Data Science &
Engineering Workspace)
December 22: How to Manage Clusters in Databricks (content now included in Introduction
to Databricks Data Science and Data Engineering Workspace)
Databricks Academy offerings
Training
Self-paced online courses - asynchronous virtual training available to
individuals through the Databricks Academy website. This type of training is
free for Databricks customers. Each course is typically less than one hour in
length.
Workshops - live one to three hour training sessions made available to
small groups, typically in a virtual format. Please contact your Databricks
Customer Success Engineer for more information on workshops.
Instructor-led training - live one to three day training sessions made
available to everyone - customers and the public, for a fee. These training
offerings are delivered virtually and on-site.
Credentials
Accreditations - low stakes credentials resulting from an unproctored
online exam administered through the Databricks Academy website. They
are earned after demonstrating mastery of technology areas at the
introductory level, and are in alignment with self-paced training.
Certifications - higher stakes credentials resulting from a proctored exam
administered through a testing vendor. They are earned after demonstrating
mastery of intermediate and advanced technical areas. They are in
alignment with instructor-led training, and are role-based. Unlike
accreditations, which are prepared for a general audience, certifications are
designed to align with data practitioner roles (for example, a data engineer
or a data analyst role).
Learning paths
The learning paths included below are designed to help guide users to the courses
most relevant to them.
Databricks Overview
Platform administration
Data analysis
Data science / machine learning
Data engineering
Course descriptions
Apache Spark Programming with
Databricks
Type: Instructor-led ($1,500 USD); Self-paced: Free for customers
Duration: 2 days (12 hours)
Course description: This course uses a case study-driven approach to explore the
fundamentals of Spark Programming with Databricks, including Spark architecture,
the DataFrame API, query optimization, and Structured Streaming. First, you will
become familiar with Databricks and Spark, recognize their major components, and
explore datasets for the case study using the Databricks environment. After
ingesting data from various file formats, you will process and analyze datasets by
applying a variety of DataFrame transformations, Column expressions, and built-in
functions. Lastly, you will execute streaming queries to process streaming data and
highlight the advantages of using Delta Lake.
Prerequisites:
● Familiarity with basic SQL concepts (select, filter, groupby, join, etc)
● Beginner programming experience with Python or Scala (syntax, conditions,
loops, functions)
Learning objectives:
● Identify core features of Spark and Databricks.
● Describe how DataFrames are created and evaluated in Spark.
● Apply DataFrame transformations to process and analyze data.
● Apply Structured Streaming to process streaming data.
● Explain fundamental Delta Lake concepts
Course agenda:
● Day 1: DataFrames
○ Introduction
○ Databricks Platform
○ Spark SQL
○ Reader & Writer
○ DataFrame & Column
● Day 2: Transformations
○ Aggregation
○ Datetimes
○ Complex Types
○ Additional Functions
○ User-Defined Functions
● Day 3: Spark Optimization
○ Spark Architecture
○ Shuffles and Caching
○ Query Optimization
○ Spark UI
○ Partitioning
● Day 4: Structured Streaming
○ Review
○ Streaming Query
○ Processing Streams
○ Aggregating Streams
○ Delta Lake
Applications of SQL on Databricks
Type: Self-paced; Cost: Free for customers
Duration: 1 hour
Course description: In the course Introduction to SQL on Databricks, we introduce
Spark and Spark SQL as a solution for using common SQL syntax when working with
structured or semi-structured data. In this course, you will use Spark SQL on
Databricks to practice common design patterns for efficiently creating new tables,
explore built-in functions that can help you explore, manipulate, and aggregate
nested data.
Prerequisites & Requirements
● Prerequisites
○ Basic SQL commands
○ Experience working with SQL in a Databricks notebook
Learning objectives
● Use optional arguments in CREATE TABLE to define data format and location
in a Databricks database
● Efficiently copy, modify and create new tables from existing ones
● Use built-in functions and features of Spark SQL to Manage and manipulate
nested objects
● Use roll-up, cube, and window functions to aggregate data and pivot tables
Applied Data Engineering Solutions with
Databricks
Type: Instructor-led ($1,500 USD); Self-paced: Free for customers
Duration: 2 days (12 hours)
Course description: In this course, formerly known as Advanced Data Engineering
with Databricks, participants will learn about advanced topics in building and
maintaining data engineering workloads on Databricks. This course highlights those
features of the Databricks Lakehouse platform that make it well-suited for
production data engineering, with an emphasis on Spark 3, Delta Lake, Structured
Streaming, and proprietary platform features.
Prerequisites:
● Advanced experience using Apache Spark
● Advanced experience coding with Python
● Intermediate experience writing SQL queries
● Intermediate experience using Databricks platform
● Intermediate experience using Delta Lake
● Intermediate experience using Structured Streaming
● Intermediate knowledge of data engineering concepts
Learning objectives:
● Design and implement multi-pipeline multi-hop architecture to enable the
Lakehouse paradigm.
● Deploy Structured Streaming operations that take advantage of Databricks
Job scheduling capabilities and avoid workspace limitations.
● Implement Databricks-native code that leverages platform-specific Delta
Lake features to simplify production workloads.
● Master design patterns that enable common use cases, including change data
capture (CDC), slowly changing dimensions (SCD), and managing personally
identifiable information (PII).
AWS Databricks Cloud Architecture and
System Integration Fundamentals
Type: Self-paced; Cost: Free for customers
Duration: 1 hour
Course description: While the Databricks Unified Analytics Platform provides a
broad range of functionality to many members of data teams, it is through
integrations with other services that most cloud-native applications will achieve
results desired by customers. This course is a series of demos designed to help
students understand the portions of cloud workloads appropriate for Databricks.
Within these demos, we'll highlight integrations with first-party services in the AWS
cloud to build scalable and secure applications.
Prerequisites:
● Beginning knowledge of Spark programming (reading/writing data, batch and
streaming jobs, transformations and actions)
● Beginning-level experience using Python or Scala to perform basic control
flow operations.
● Familiarity with navigation and resource configuration in the AWS Console.
Learning objectives:
● Describe use cases for Databricks in an enterprise cloud architecture.
● Configure secure connections from Databricks to data in S3.
● Configure connections between Databricks and various first-party tools in an
enterprise cloud architecture, including Redshift and Kinesis.
● Deploy an MLflow model to a Sagemaker endpoint for serving online model
predictions.
● Configure Glue as an enterprise data catalog
AWS Databricks Cluster Usage
Management
Type: Self-paced; Cost: Free for customers
Duration: 1.5 hours
Course description:In this course, you will first define computation resources
(clusters, jobs, and pools) and determine which resources to use for different
workloads. Then, you will learn cluster provisioning strategies for several use cases to
maximize usability and cost-effectiveness. You will also identify best practices for
cluster governance, including cluster policies. This course also covers capacity
limits, cost management, and chargeback analysis.
Prerequisites:
● Beginning experience using the Databricks workspace
● Beginning experience with Databricks administration
Learning objectives:
● Define computation resources (clusters, jobs, and pools) and determine
which resources to use for different workloads.
● Describe cluster provisioning strategies for several use cases to maximize
usability and cost effectiveness.
● Identify best practices for cluster governance, including cluster policies.
● Describe capacity limits on Azure Databricks.
● Describe how to manage costs and perform chargeback analysis.
AWS Databricks Data Access
Management
Type: Self-paced; Cost: Free for customers
Duration: 1 hour
Course description:In this course, you will learn about the Databricks File System
and Hive Metastore concepts. Then, you will apply best practices to secure access
to Amazon S3 from Databricks. Next, you will configure access control for data
objects including tables, databases, views, and functions. You will also apply column
and row-level permissions and data masking with dynamic views for multiple users
and groups. Lastly, you will identify methods for data isolation within your
organization on Databricks.
Prerequisites:
● Beginning experience with AWS Databricks security, including deployment
architecture and encryptions
● Beginning experience with AWS Databricks administration, including identity
management and workspace access control
● Beginning experience using the Databricks workspace
● Databricks Premium Plan
Learning objectives:
● Describe fundamental concepts about the Databricks File System and Hive
Metastore.
● Apply best practices to secure access to Amazon S3 from Databricks.
● Configure access control for data objects including tables, databases, views,
and functions.
● Apply column and row-level permissions and data masking with dynamic
views for multiple users and groups.
● Identify methods for data isolation within your organization on Databricks.
AWS Databricks Identity Access
Management
Type: Self-paced; Cost: Free for customers
Duration: 1.5 hours
Course description:In this course, you will learn how to manage user accounts and
groups in the Admin Console. You will also learn how to manage token-based
authentication and settings for your workspace, such as workspace storage and
additional cluster configurations. Lastly, this course covers access control for
workspace objects, such as notebooks and folders, in addition to clusters, pools, and
jobs.
Prerequisites:
● Experience using a web browser.
● Note: To perform the tasks shown in this course, you will need a Databricks
workspace deployment with administrator rights.
Learning objectives:
● Manage user accounts and groups in the Admin Console.
● Generate and manage personal access tokens for authentication.
● Enable additional cluster configurations and purge deleted objects from
workspace storage.
● Configure access control for workspace objects, such as notebooks and
folders.
● Configure access control for clusters, pools, and jobs.
AWS Databricks Security Fundamentals
Type: Self-paced; Cost: Free for customers
Duration: 1 hour
Course description: This course provides an overview of essential security features
to consider when managing your AWS Databricks workspace. You will start by
identifying components of the Databricks platform architecture and deployment
models. Then, you will define several features regarding network security including
no public IPs, Bring Your Own VPC, VPC peering, and IP access lists. After recognizing
IdP integrations, you will explore access control configurations for different
workspace assets. You will then identify encryptions and permissions available for
data protection, such as IdP authentication, secrets, and table access control. Lastly,
you will describe security standards and configurations for compliance, including
cluster policies, Bring Your Own Key, and audit logs.
Prerequisites:
● Beginning-level knowledge of basic AWS cloud computing terms (ex. S3, VPC,
IAM, etc.)
● Beginning-level knowledge of basic Databricks concepts (ex. workspace,
clusters, notebooks, etc.)
Learning objectives:
● Describe components of the AWS Databricks platform architecture and
deployment model.
● Explain network security features including no public IP address, Bring Your
Own VPC, VPC peering, and IP access lists.
● Describe identity provider integrations and access control configurations for
an AWS Databricks workspace.
● Explain encryptions and permissions available for data protection, such as
identity provider authentication, secrets, and table access control.
● Describe security standards and configurations for compliance, including
cluster policies, Bring Your Own Key, and audit logs.
AWS Databricks SQL Administration
Type: Self-paced; Cost: Free for customers
Duration: 1 hour
Course description: In this course, you will learn how to set up and configure access
to the Databricks SQL Analytics user interface. The administrative tasks in this
course will be done using the Databricks Workspace and Databricks SQL Analytics
UI, and will not include instruction for API access. By the end of this course, you will
be able to set up computational resources for users, grant and revoke access to
specific data, manage users and groups, and set up alert destinations.
Prerequisites:
● Intermediate knowledge of Databricks
● Databricks account on the Premium plan (with SQL Analytics enabled)
● Administrator credentials to your organization’s Databricks Workspace
Learning objectives:
● Describe how Databricks SQL Analytics is used by data practitioners.
● Manage user and group access to Databricks SQL Analytics.
● Configure and monitor SQL Endpoints to maximize performance, control
costs, and track usage on Databricks SQL Analytics.
● Set up access to data storage through SQL endpoints or external data stores
in order for users to access data on Databricks SQL Analytics.
● Control user access to data objects (e.g. tables, databases, and views) by
programmatically setting privileges for specific users and/or groups on
Databricks SQL Analytics.
● Create and configure Databricks SQL Analytics alert destinations for users.
AWS Databricks Workspace Deployment
Type: Self-paced; Cost: Free for customers
Duration: 1 hour
Course description: This course will walk you through setting up your Databricks
account including setting up billing, configuring your AWS account, and adding users
with appropriate permissions. At the end of this course, you'll find guidance and
resources for additional setup options and best practices.
Prerequisites:
● Experience using a web browser.
● Note: To follow along with this course, you will need access to a Databricks
account with Account Owner permissions.
Learning objectives:
● Access the Databricks account console and set up billing.
● Configure an AWS account using cross-account role or access keys.
● Configure AWS storage and deploy the Databricks workspace.
● Add users and assign admin or cluster creation rights.
● Identify resources for additional setup options and best practices.
Azure Databricks Cloud Architecture and
System Integration Fundamentals
Type: Self-paced; Cost: Free for customers
Duration: 1 hour
Course description: While the Databricks Unified Analytics Platform provides a
broad range of functionality to many members of data teams, it is through
integrations with other services that most cloud-native applications will achieve
results desired by customers. This course is designed to help students understand
the portions of cloud workloads appropriate for Databricks, and highlight
integrations with first-party services in the Azure cloud to build scalable and secure
applications.
Prerequisites:
● Beginning knowledge of Spark programming (reading/writing data, batch and
streaming jobs, transformations and actions)
● Beginning-level experience using Python or Scala to perform basic control
flow operations.
● Familiarity with navigation and resource configuration in the Azure Portal.
Learning objectives:
● Describe use-cases for Azure Databricks in an enterprise cloud architecture.
● Configure secure connections to data in an Azure storage account.
● Configure connections from Databricks to various first-party tools, including
Synapse, Key Vault, Event Hubs, and CosmosDB.
● Configure Azure Data Factory to trigger production jobs on Databricks.
● Trigger CI/CD workloads on Databricks assets using Azure DevOps.
Azure Databricks Cluster Usage
Management
Type: Self-paced; Cost: Free for customers
Duration: 1 hour
Course description: In this course, you will first define computation resources
(clusters, jobs, and pools) and determine which resources to use for different
workloads. Then, you will learn cluster provisioning strategies for several use cases to
maximize usability and cost effectiveness. You will also identify best practices for
cluster governance, including cluster policies. This course also covers capacity
limits, cost management, and chargeback analysis.
Prerequisites:
● Beginning experience with the Databricks workspace UI
● Beginning experience with Databricks administration
Learning objectives:
● Define computation resources (clusters, jobs, and pools) and determine
which resources to use for different workloads.
● Describe cluster provisioning strategies for several use cases to maximize
usability and cost effectiveness.
● Identify best practices for cluster governance, including cluster policies.
● Describe capacity limits on Azure Databricks.
● Describe how to manage costs and perform chargeback analysis.
Azure Databricks Data Access
Management
Type: Self-paced; Cost: Free for customers
Duration: 1 hour
Course description: In this course, you will learn about the Databricks File System
and Hive Metastore concepts. Then, you will apply best practices to secure access
to Azure data storage from Azure Databricks. Next, you will configure access control
for data objects including tables, databases, views, and functions. You will also apply
column and row-level permissions and data masking with dynamic views for
multiple users and groups. Lastly, you will identify methods for data isolation within
your organization on Azure Databricks.
Prerequisites:
● Beginning experience with Azure Databricks security, including deployment
architecture and encryptions
● Beginning experience with Azure Databricks administration, including identity
management and workspace access control
● Beginning experience using the Azure Databricks workspace
● Azure Databricks Premium Plan
Learning objectives:
● Describe the Databricks File System and Hive Metastore concepts.
● Apply best practices to secure access to Azure data storage from Azure
Databricks.
● Configure access control for data objects including tables, databases, views,
and functions.
● Apply column and row-level permissions and data masking with dynamic
views for multiple users and groups.
● Identify methods for data isolation within your organization on Azure
Databricks.
Azure Databricks Identity Access
Management
Type: Self-paced; Cost: Free for customers
Duration: 45 minutes
Course description: In this course, you will learn how to manage user accounts and
groups in the Admin Console. You will also learn how to manage token-based
authentication and settings for your workspace, such as workspace storage and
additional cluster configurations. Lastly, this course covers access control for
workspace objects, such as notebooks and folders, in addition to clusters, pools, and
jobs.
Prerequisites:
● Beginning experience using the Databricks workspace.
Learning objectives:
● Manage user accounts and groups in the Admin Console.
● Generate and manage personal access tokens for authentication.
● Enable additional cluster configurations and purge deleted objects from
workspace storage.
● Configure access control for workspace objects, such as notebooks and
folders.
● Configure access control for clusters, pools, and jobs.
Azure Databricks Security Fundamentals
Type: Self-paced; Cost: Free for customers
Duration: 1.5 hours
Course description: This course provides an overview of essential security features
to consider when managing your Azure Databricks workspace. You will start by
identifying components of the Azure Databricks platform architecture and
deployment model. Then, you will define several features regarding network security
including no public IPs, Bring Your Own VNET, VNET peering, and IP access lists. After
recognizing IdP and AAD integrations, you will explore access control configurations
for different workspace assets. You will then identify encryptions and permissions
available for data protection, such as IdP authentication, secrets, and table access
control. Lastly, you will describe security standards and configurations for
compliance, including cluster policies, Bring Your Own Key, and audit logs.
Prerequisites:
● Beginning-level knowledge of basic Azure cloud computing terms (ex. Blob
storage, ADLS, VNET, Azure Active Directory, etc.)
● Beginning-level knowledge of basic Databricks concepts (ex. workspace,
clusters, notebooks, etc.)
Learning objectives:
● Describe components of the Azure Databricks platform architecture and
deployment model.
● Explain network security features including no public IP address, Bring Your
Own VNET, VNET peering, and IP access lists.
● Describe identity provider and Azure Active Directory integrations and access
control configurations for an Azure Databricks workspace.
● Explain encryptions and permissions available for data protection, such as
identity provider authentication, secrets, and table access control.
● Describe security standards and configurations for compliance, including
cluster policies, Bring Your Own Key, and audit logs.
Azure Databricks SQL Administration
Type: Self-paced; Cost: Free for customers
Duration: 1 hour
Course description: In this course, you will learn how to set up and configure access
to the Databricks SQL Analytics user interface. The administrative tasks in this
course will be done using the Databricks Workspace and Databricks SQL Analytics
UI, and will not include instruction for API access. By the end of this course, you will
be able to set up computational resources for users, grant and revoke access to
specific data, manage users and groups, and set up alert destinations.
Prerequisites:
● Intermediate knowledge of Databricks
● Databricks account on the Premium plan (with SQL Analytics enabled)
● Administrator credentials to your organization’s Databricks Workspace
Learning objectives:
● Describe how Databricks SQL Analytics is used by data practitioners.
● Manage user and group access to Databricks SQL Analytics.
● Configure and monitor SQL Endpoints to maximize performance, control
costs, and track usage on Databricks SQL Analytics.
● Set up access to data storage through SQL endpoints or external data stores
in order for users to access data on Databricks SQL Analytics.
● Control user access to data objects (e.g. tables, databases, and views) by
programmatically setting privileges for specific users and/or groups on
Databricks SQL Analytics.
● Create and configure Databricks SQL Analytics alert destinations for users.
Azure Databricks Workspace Deployment
Type: Self-paced; Cost: Free for customers
Duration: 10 minutes
Course description: In this course, you will identify the prerequisites for creating an
Azure Databricks workspace, deploy an Azure Databricks workspace in the Azure
portal, launch the workspace, and access the Admin Console.
Prerequisites:
● To complete the actions outlined in this course, you must have access to an
Azure subscription.
Learning objectives:
● Identify prerequisites for launching an Azure Databricks workspace.
● Deploy an Azure Databricks workspace in the Azure portal.
● Launch the deployed Azure Databricks workspace.
● Access the Admin Console in the deployed Azure Databricks workspace.
Certification Prep Course for the
Databricks Certified Associate Developer
for Apache Spark Exam
Type: Self-paced; Cost: Free for customers
Duration: 2 hours
Course description: Prepare to take the Databricks Certified Associate Developer for
Apache Spark Exam. This course will cover the format and structure of the exam,
skills needed for the exam, tips for exam preparation, and the parts of the
DataFrame API and Spark architecture covered in the exam.
Prerequisites:
● Describe the basics of the Apache Spark architecture.
● Perform basic data transformations using the Apache Spark DataFrame API
using Python or Scala.
● Perform basic data input and output using the Apache Spark DataFrame API
using Python or Scala.
● Perform custom data actions using user-defined functions using Python or
Scala.
● Perform data transformations using Spark SQL.
● Note: While the above skills are not necessary for this course, the course will
be far more helpful in preparing students if they have these skills.
Learning objectives:
● Summarize the learning context behind the Databricks Certified Associate
Developer for Apache Spark exam.
● Describe the topics covered in the Databricks Certified Associate Developer
for Apache Spark exam.
● Describe the format and structure of the Databricks Certified Associate
Developer for Apache Spark exam.
● Apply practical test-taking strategies to answer example questions similar to
those of the Databricks Certified Associate Developer for Apache Spark
exam.
● Identify resources that can be used to learn the material covered in the
Databricks Certified Associate Developer for Apache Spark exam.
Configuring Workspace Access Control
Lists (ACLs)
Type: Self-paced; Cost: Free for customers
Duration: 1 hour
Course description: Databricks has extensive access control lists (ACLs) for
workspace assets to help administrators restrict and grant access to appropriate
users. This course includes a set of instructions and caveats for configuring many of
these settings, as well as a video walkthrough showing this configuration and the
resultant user experience.
Prerequisites:
● Basic knowledge of the Databricks workspace
Learning objectives:
● Manage permissions for groups of users.
● Control access to notebooks and folders.
● Restrict access for cluster creation and editing.
● Change ownership of configured jobs.
Data Engineering with Databricks
Type: Instructor-led course: $1000 USD; Self-paced (available free for customers)
Duration: 2 days (12 hours)
Course description: This course begins with a review of programming with Spark
APIs and an introduction to key terms and definitions of Databricks data engineering
tools, followed by an overview of DB Connect, the Spark UI, and writing testable
code. Participants will learn about the Cloud Data Platform in terms of data
architecture concepts and will build an end-to-end OLAP data pipeline using Delta
Lake with batch and streaming data, learning best practices throughout. Participants
who wish to dive deeper into tuning and optimization can take the Advanced Data
Engineering with Databricks course.
Prerequisites:
● Intermediate to advanced programming skills in Python or Scala
● Intermediate to advanced SQL skills
● Beginning experience using the Spark DataFrames API
● Beginning knowledge of general data engineering concepts
● Beginning knowledge of the core features and use cases of Delta Lake
Learning objectives:
● Build an end-to-end batch and streaming OLAP data pipeline.
● Make data available for consumption by downstream stakeholders using
specified design patterns.
● Apply Databricks' recommended best practices in engineering a single source
of truth Delta architecture.
Course agenda:
● Note: There are approximately 13 hours of content in this course, divided into
four parts (Part 1, Part 2, Part 3, Part 4).
○ Two-day deliveries will go through Parts 1 and 2 on Day 1 and Parts 3
and 4 on Day 2.
○ Four-day day deliveries will go through one part per day.
● Part 1: Course Welcome and Setup
○ Introduction
○ The Big Picture
○ Software Engineering
○ Configuration and Utilities
○ Planning your Data Pipeline
○ Engineering a Data Pipeline
● Part 2: Streaming Delta Tables
○ Ingesting Raw Data
○ Raw to Bronze
○ Delta Table Versioning
○ Bronze to Silver
○ Silver Update
○ The Query Layer
○ Silver to Gold
● Part 3: Batch Delta Tables
○ Planning your Data Pipeline
○ Configuration and Utilities
○ Ingest - Raw
○ Raw to Bronze
○ Bronze to Silver
○ Silver Update
○ Delta Architecture
○ Schema Enforcement and Evolution
● Part 4: Compliance and Optimization
○ GDPR & CCPA Compliance
○ Normalization
○ SCD and CDC
○ Delta Engine Optimizations
Data Science on Databricks Rapid Start
Type: Self-paced; Cost: Free for customers
Duration: 1 hour
Course description: This course will provide an overview of the features and
functionality within the Unified Data Analytics Platform that enable data
practitioners to follow data science and machine learning workflows. Aside from an
overview of features and functionality, the course will provide learners with
hands-on experience using the Unified Data Analytics Platform to execute basic
tasks and solve a real-world problem.
Prerequisites:
● Beginning experience with Python as applied to data science and analysis.
● Beginning experience with popular data science tools such as Pandas,
charting libraries
● Beginning experience working with notebooks (not necessarily Databricks
notebooks)
Learning objectives:
● Summarize Databricks functionality that enables data practitioners to work
with data through the data science workflow.
● Summarize Databricks functionality that enables data practitioners to run
machine learning experiments on data.
● Solve a given real-world problem by executing basic data science tasks in the
Unified Data Analytics Platform.
Data Science on Databricks -
Bias-Variance Tradeoff
Type: Self-paced; Cost: Free for customers
Duration: 1 hour
Course description: In this course, we’ll show you how to use scikit-learn on
Databricks, along with some core statistical and data science principles, to select a
family of machine learning models for deployment.
This course is the first in a series of three courses developed to show you how to
use Databricks to work with a single data set from experimentation to
production-scale machine learning model deployment. The other courses in this
series include:
● Tracking Experiments with MLflow
● Deploying a Machine Learning Project with MLflow Project
Prerequisites:
● Beginning-level experience running data science workflows in the Databricks
Workspace
● Beginner-level experience with Apache Spark
● Intermediate-level experience with the Scipy Numerical Stack
Learning objectives:
● Create and explore an aggregate sample from user event data.
● Design an MLflow experiment to estimate model bias and variance.
● Use exploratory data analysis and estimated model bias and variance to
select a family of models for model development.
Data Visualization with Databricks SQL
Type: Self-paced; Cost: Free for customers
Duration: 1 hour
Course description: In this course, you will learn how to use Databricks SQL, an
integrated SQL editing and dashboarding tool, from your Databricks workspace.
Working with Databricks SQL allows you to easily query your data lake, or other data
sources, and build dashboards that can be easily shared across your organization.
You will learn how to parameterize queries so that users can easily modify
dashboard views to target specific results. Also, we will make use of alerts for
ongoing monitoring so that you can be notified when certain events occur or when
particular attributes of a data set reach a certain threshold.
Prerequisites:
● Access to the Databricks SQL interface
● Intermediate experience using the Databricks platform
● Intermediate experience with SQL
● Intermediate experience with data analysis concepts
Learning objectives:
● Describe how you can use SQL from your Databricks workspace.
● Execute queries and create visualizations using Databricks SQL.
● Write parameterized queries so that users can easily customize their results
and visualizations.
● Create and share dashboards that hold a collection of visualizations.
Databases, Tables, and Views on
Databricks
Type: Self-paced; Cost: Free for customers
Duration: 35 minutes
Course description: In this short course, you’ll learn how to create databases, tables,
and views on Databricks. Special attention will be given to differences in scope and
persistence for these various entities, allowing any user responsible for creating or
managing databases, tables, or views to make informed decisions for their
organization. While the syntax for creating and working with databases, tables, and
views will be familiar to most SQL users, some default behaviors may surprise users
new to Databricks.
Prerequisites:
● Beginning knowledge of SQL
● Beginning knowledge of loading and interacting with sample data from
Databricks.
● Beginning knowledge of using Databricks notebooks
Learning objectives:
● Describe persistence and scope of databases, tables, and views on
Databricks.
● Compare and contrast the behavior of managed and unmanaged tables.
● Summarize best practices for creating and managing databases, tables, and
views on Databricks.
Databricks Command Line Interface (CLI)
Fundamentals
Type: Self-paced; Cost: Free for customers
Duration: 45 minutes
Course description: While the Databricks platform web-based graphical user
interface provides powerful functionality for data teams, many use cases call for
programmatic command line access. The Databricks command line interface (CLI)
provides access to a variety of powerful workspace features. This module is not
intended as a comprehensive overview of all the CLI can do, but rather an
introduction to some of the common features users may desire to leverage in their
workloads.
Prerequisites:
● Familiarity with Apache Spark concepts
● Familiarity with the data engineering capabilities of the Databricks Platform
● Intermediate experience using the Databricks platform for data engineering
(creating clusters, loading notebooks, scheduling jobs, etc.)
Learning objectives:
● Install and configure the Databricks CLI to securely interact with the
Databricks Workspace.
● Configure workspace secrets using the CLI for more secure sharing and use of
string-based credentials in notebooks.
● Sync notebooks and libraries between the Databricks workspace and other
environments using the CLI.
● Perform a variety of tasks including interacting with clusters, jobs, and runs
using the CLI.
Databricks Datadog Integration
Type: Self-paced; Cost: Free for customers
Duration: 1 hour
Course description: Datadog provides customizable integration scripts and
dashboards to integrate your Databricks logs into your larger monitoring ecosystem.
This lesson goes through basic configuration, as well as extending this configuration
to add additional security and custom tagging.
Prerequisites:
● Basic familiarity with the Databricks workspace
● Basic familiarity with cluster configuration
Learning objectives:
● Configure both ends of the Databricks Datadog integration.
● Add custom variables to your monitored clusters.
● Use Databricks secrets to redact API tokens.
Databricks on Google Cloud: Architecture
and Security Fundamentals
Type: Self-paced; Cost: Free for customers
Duration: 1.5 hours
Course description:This course dives into the platform architecture and key security
features of Databricks on Google Cloud. You will start with an overview of Databricks
on Google Cloud and how it integrates with the Google Cloud ecosystem. Then, you
will define core components of the platform architecture and deployment model on
Databricks on Google Cloud. You will also learn about key security features to
consider when provisioning and managing workspaces, as well as guidelines on
network security, identity and access management, and data protection.
Prerequisites & Requirements
● Prerequisites
○ Basic familiarity with Databricks concepts (workspace, notebooks,
clusters, DBFS, etc)
○ Basic familiarity with Google Cloud concepts (projects, IAM, GCS, VPC,
subnets, GKE, etc)
Learning objectives
● Describe how Databricks integrates with the Google Cloud ecosystem.
● Identify components of the Databricks on Google Cloud platform architecture
and deployment model.
● Recognize best practices for network security when deploying workspaces.
● Describe identity management and access control features in Databricks on
Google Cloud.
● Identify storage locations and data protection features in Databricks on
Google Cloud.
Databricks on Google Cloud: Cloud
Architecture and System Integration
Type: Self-paced; Cost: Free for customers
Duration: 1.5 hours
Course description:This course is a series of demos designed to help students
understand the portions of cloud workloads appropriate for Databricks. Within these
demos, we'll highlight integrations with first-party services in Google Cloud to build
scalable and secure applications.
Prerequisites & Requirements
● Prerequisites
○ Familiarity with the Databricks on Google Cloud workspace
○ Beginning knowledge of Spark programming (reading/writing data,
batch and streaming jobs, transformations and actions)
○ Beginning-level experience using Python or Scala to perform basic
control flow operations.
○ Familiarity with navigation and resource configuration in the Databricks
on Google Cloud Console.
Learning objectives
● Describe where Databricks fits into a cloud-based architecture on Google
Cloud.
● Authenticate to Google Cloud resources with service account credentials.
● Read and write data to Cloud Storage using Databricks secrets.
● Mount a GCS bucket to DBFS using cluster-wide service accounts.
● Configure a cluster to read and write data to BigQuery using credentials in
DBFS.
Databricks on Google Cloud: Cluster
Usage Management
Type: Self-paced; Cost: Free for customers
Duration: 30 minutes
Course description: This course covers essential cluster configuration features and
provisioning guidelines for Databricks on Google Cloud. In this course, you will start
by defining core computation resources (clusters, jobs, and pools) and determine
which resources to use for different workloads. Then, you will learn cluster
provisioning strategies for several use cases to maximize manageability. Lastly, you
will learn how to manage cluster usage and cost for your Databricks on Google Cloud
workspaces.
Prerequisites & Requirements
● Prerequisites
○ Beginning experience using the Databricks workspace
○ Beginning experience with Databricks administration
Learning objectives
● Describe the core computation resources in Databricks, clusters, jobs, and
pools.
● Recognize best practices for configuring cluster resources for different
workloads.
● Identify cluster provisioning use cases and strategies for manageability.
● Describe how to manage cluster usage and cost for Databricks on Google
Cloud.
Databricks on Google Cloud: Workspace
Deployment
Type: Self-paced; Cost: Free for customers
Duration: 20 minutes
Course description: This is a short course that shows new customers how to set up
a Databricks account and deploy a workspace on Google Cloud. This will cover
accessing the Account Console and adding account admins, provisioning and
accessing workspaces, and adding users and admins to a workspace.
Prerequisites & Requirements
● Prerequisites
○ Basic familiarity with Databricks concepts (Databricks account,
workspace, DBFS, etc)
○ Basic familiarity with Google Cloud concepts (Cloud console, project,
GCS, IAM, VPC, etc)
Learning objectives
● Access the Databricks Account Console.
● Add account admins in the Account Console.
● Provision and access a Databricks workspace.
● Access the Admin Console for a Databricks workspace.
● Add workspace users and admins in the Admin Console.
Databricks with R
Type: Self-paced; Cost: Free for customers
Duration: 7 hours
Course description:In this seven-hour course, you will analyze clickstream data from
an imaginary mattress retailer called Bedbricks. In this case study, you'll explore the
fundamentals of Spark Programming with R on Databricks, including Spark
architecture, the DataFrame API, and Machine Learning.
Prerequisites & Requirements
● Prerequisites
○ Beginning experience working with R.
Learning objectives
● Identify core features of Spark and Databricks.
● Describe how DataFrames are created and evaluated in Spark.
● Apply the DataFrame transformation API to process and analyze data.
Databricks Workspace Fundamentals for
Business Analytics
Type: Self-paced; Cost: Free for customers
Duration: 1 hour
Course description: This course will showcase Databricks Workspace (Workspace)
functionality useful to business/SQL analysts. It will begin with a review of
vocabulary terms helpful for working in the Workspace and include simple workflows
that data practitioners can use to quickly query and analyze data.
Prerequisites:
● Beginning experience with SQL is helpful (examples will be done using SQL)
● Beginning knowledge about what Databricks is and what it does
Learning objectives:
● Summarizing fundamental concepts for using Databricks effectively.
● Explain how data professionals can use Databricks to extract meaning from
data.
● Use Databricks to follow a traditional data workflow to extract meaning from
data.
Delta Lake Rapid Start with Python
Type: Self-paced; Cost: Free for customers
Duration: 2 hours
Course description: Apache Spark™ is the dominant processing framework for big
data. Delta Lake is a robust storage solution designed specifically to work with
Apache SparkTM. It adds reliability to Spark so your analytics and machine learning
initiatives have ready access to quality, reliable data. Delta Lake makes data lakes
easier to work with and more robust. It is designed to address many of the problems
commonly found with data lakes. This course covers the basics of working with
Delta Lake, specifically with Python, on Databricks.
Prerequisites:
● Beginning level experience using Databricks to upload and visualize data
● Intermediate level experience using Apache Spark including the CTAS pattern
and use of popular pyspark.sql functions
● Beginning level knowledge of Delta Lake
Learning objectives:
● Use Delta Lake to create a new Delta table.
● Convert an existing Parquet-based data lake table into a Delta table.
● Differentiate between a batch update and an upsert to a Delta table.
● Use Delta Lake Time Travel to view different versions of a Delta table.
● Execute a MERGE command to upsert data into a Delta table.
Delta Lake Rapid Start with Spark SQL
Type: Self-paced; Cost: Free for customers
Duration: 2 hours
Course description: Apache Spark™ is the dominant processing framework for big
data. Delta Lake is a robust storage solution designed specifically to work with
Apache Spark™. It adds reliability to Spark so your analytics and machine learning
initiatives have ready access to quality, reliable data. Delta Lake makes data lakes
easier to work with and more robust. It is designed to address many of the problems
commonly found with data lakes. This course covers the basics of working with
Delta Lake, specifically with Spark SQL, on Databricks.
Prerequisites:
● How to upload data into a Databricks Workspace
● How to visualize data using Databricks
● Intermediate level Spark SQL usage including the CTAS pattern, use of Spark
SQL functions such as from_unixtime, lag, lead, and partitioning.
Learning objectives:
● Use Delta Lake to create a new Delta table and to convert an existing
Parquet-based data lake table
● Differentiate between a batch append and an upsert to a Delta table
● Use Delta Lake Time Travel to view different versions of a Delta tables
● Execute a MERGE command to upsert data into a Delta table
Deploying a Machine Learning Project with
MLflow Projects
Type: Self-paced; Cost: Free for customers
Duration: 2 hours
Course description: In this course, we’ll show you how to train and deploy a large
scale machine learning model using MLflow and Apache Spark. This course is the
third in a series of three courses developed to show you how to use Databricks to
work with a single data set from experimentation to production-scale machine
learning model deployment. We recommend taking the first two courses in this
series before continuing with this course:
● Building and Deploying Machine Learning Models: The Bias-Variance Tradeoff
● Tracking Experiments with MLflow
Prerequisites:
● Beginning-level experience running data science workflows in the Databricks
Workspace
● Beginner-level experience with Apache Spark
● Intermediate-level experience with the Scipy Numerical Stack
● Intermediate-level experience with the command line
Learning objectives:
● Summarize Databricks best practices for deploying machine learning projects
with MLflow.
● Explain local development strategies for writing software with Databricks.
● Use Databricks to write production-grade machine learning software.
ELT with Spark SQL
Type: Self-paced; Cost: Free for customers
Duration: 2.5 hours
Course description: This course teaches experienced SQL analysts and engineers
how to complete common ELT tasks using Spark SQL on Databricks. Students will
extract data from multiple data sources, load data into Delta Lake tables, and apply
data quality checks and transformations. Students will also learn how to leverage
existing tables in a Lakehouse for last-mile ETL to support dashboards and
reporting.
Prerequisites:
● Students should be able to navigate the Databricks workspace (creating and
loading notebooks, connecting to clusters)
● Students should have intermediate fluency in SQL
● Students should be familiar with relational entities on Databricks
● Students should be familiar with the high-level architecture of the Lakehouse
Learning objectives:
● Extract data from a variety of common data sources using Spark SQL in the
Databricks Data Science and Engineering workspace
● Load data into Delta Lake tables using the Databricks Data Science and
Engineering workspace
● Apply transformations to complete common cleaning tasks and data quality
checks using the Databricks Data Science and Engineering workspace
● Reshape datasets with advanced functions to derive analytical insights using
the Databricks Data Science and Engineering workspace
Easy ETL with Auto Loader
Type: Self-paced; Cost: Free for customers
Duration: 1 hour
Course description: Databricks Auto Loader is the preferred method for ingesting
incremental data landing in cloud object storage into your Lakehouse. This course
introduces Auto Loader and demonstrates some of the newer features added to this
product. Included are recommended patterns for data ingestion with Auto Loader.
Prerequisites:
● Basic experience with Spark APIs
● Basic knowledge of Delta Lake
● Basic experience with Structured Streaming
Learning objectives:
● Describe the basic functionality and features of Auto Loader.
● Use Auto Loader to ingest data to Delta Lake without losing data.
● Configure automatic schema detection and evolution.
● Rescue unexpected data arriving in well-structured datasets.
Enterprise Architecture with Databricks
Type: Self-paced; Cost: Free for customers
Duration: 7 hours
Course description: In this course you’ll learn about how business leaders, admins,
and architects use Databricks in their architecture . We’ll cover fundamental
concepts about key players: Data Engineers, Data Scientists, Platform Administrator;
raw data forms: structured and unstructured data, batch and streaming data, to help
set the stage for our discussion on how end users help businesses create data
assets like machine learning models, reports, and dashboards. Then, we’ll discuss
where components of Databricks Azure fit into an organization’s big data ecosystem.
Finally, we’ll review real-world business use cases and create enterprise level
architecture infrastructure diagrams.
Prerequisites:
● Beginning knowledge about characteristics that define big data (3 of the Vs of
big data - velocity, volume, variety)
● Beginning knowledge about how organizations process and manage big data
(Relational/SQL vs NoSQL, cloud vs. on-premise, open-source database vs.
closed-source database as a service)
● Beginning knowledge about the roles that data practitioners play on data
science teams (can distinguish between database administrators and data
scientists, data analysts and machine learning engineers, data engineers and
platform administrators)
Learning objectives:
● Create a requirements document which profiles the data needs of an
organization.
● Translate business needs related to data analytics into technical
requirements used for drawing an architectural diagram.
● Translate the Databricks Lakehouse Architecture with Delta to a technical
requirements document.
● Design Azure Databricks architectures that includes integration with Azure
services, for real-world scenarios.
● Evaluate, analyze, and validate detailed infrastructure designs.
● Create infrastructure designs.
Fundamentals of Big Data
Type: Self-paced; Cost: Free for customers
Duration: 1 hour
Course description: This course was created for individuals who are new to the big
data landscape and want to become conversant with big data terminology. It will
cover foundational concepts related to the big data landscape including:
characteristics of big data; the relationship between big data, artificial intelligence,
and data science; how individuals on data science teams work with big data; and
how organizations can use big data to enable better business decisions.
Prerequisites:
● Experience using a web browser
Learning objectives:
● Explain foundational concepts used to define big data.
● Explain how the characteristics of big data have changed traditional
organizational workflows for working with data.
● Summarize how individuals on data science teams work with big data on a
daily basis to drive business outcomes.
● Articulate examples of real-world use-cases for big data in businesses across
a variety of industries.
Fundamentals of Cloud Computing
Type: Self-paced; Cost: Free for customers
Duration: 30 minutes
Course description: This introductory-level course is designed to familiarize
individuals new to the cloud computing landscape. It will cover foundational
concepts related to cloud computing starting with the basics - what cloud
computing is and why, since 2011, over 30% of organizations have moved their
operations to the cloud. The course will also cover topics like cloud delivery models
and deployment types.
Please note that this course is not about cloud computing in general and does not
focus on Databricks, specifically.
Prerequisites:
● Experience using a web browser
Learning objectives:
● Summarize foundational concepts about cloud computing.
● Describe major cloud computing components.
● Explain the three major cloud computing delivery models.
● Explain the three major cloud computing deployment models.
● Outline the benefits of moving an organization’s operations to the cloud.
Fundamentals of Databricks Machine
Learning
Type: Self-paced; Cost: Free for customers
Duration: 30 minutes
Course description: Databricks Machine Learning offers data scientists and other
machine learning practitioners a platform for completing and managing the
end-to-end machine learning lifecycle. This course guides business leaders and
practitioners through a basic overview of Databricks Machine Learning, the benefits
of using Databricks Machine Learning, its fundamental components and
functionalities, and examples of successful customer use.
Prerequisites:
● Beginning-level knowledge of the Databricks Lakehouse platform
Learning objectives:
● Describe the basic overview of Databricks Machine Learning.
● Identify how using Databricks Machine Learning benefits data science and
machine learning teams.
● Summarize the fundamental components and functionalities of Databricks
Machine Learning.
● Exemplify successful use cases of Databricks Machine Learning by real
Databricks customers.
Fundamentals of Databricks SQL
Type: Self-paced; Cost: Free for customers
Duration: 30 minutes
Course description: Databricks SQL offers SQL users a platform for querying,
analyzing, and visualizing data in their organizations Lakehouse. This course explains
how Databricks SQL processes queries and guides users through how to use the
interface. Then, this course will explain how you can connect to Databricks SQL to
your favorite business intelligence tool, so that you can query your Lakehouse
without making changes to your analytical and dashboarding workflows.
Prerequisites:
● None.
Learning objectives:
● Summarize fundamental concepts for using Databricks SQL effectively.
● Identify tools and features in Databricks SQL for querying and analyzing data
as well as sharing insights with the larger organization.
● Explain how Databricks SQL supports data analysis workflows that allow users
to extract and share business insights
Fundamentals of Delta Lake
Type: Self-paced; Cost: Free for customers
Duration: 30 minutes
Course description: Delta Lake is an open format storage layer that sits on top of
your organization’s data lake. It is the foundation of a cost-effective, highly scalable
Lakehouse and is an integral part of the Databricks Lakehouse Platform.
In this course, we’ll break down the basics behind Delta Lake - what it does, how it
works, and why it is valuable from a business perspective, to any organization with
big data and AI projects.
Note: This is an introductory-level course that will *not* showcase in-depth
technical Delta Lake demos nor provide hands-on technical training with Delta Lake.
Please see the Delta Lake Rapidstart courses available in the Databricks Academy
for technical training on Delta Lake.
Prerequisites:
● Beginning knowledge of the Databricks Lakehouse Platform. We
recommended taking the course Fundamentals of the Databricks Lakehouse
Platform prior to taking this course.
Learning objectives:
● Describe how Delta Lake fits into the Databricks Lakehouse Platform.
● Explain the four elements encompassed by Delta Lake.
● Summarize high-level Delta Lake functionality that helps organizations solve
common challenges related to enterprise-scale data analytics.
● Articulate examples of how organizations have employed Delta Lake on
Databricks to improve business outcomes.
Fundamentals of Enterprise Data
Management Systems
Type: Self-paced; Cost: Free for customers
Duration: 1 hour
Course description: Whether your organization is moving to the cloud for the first
time or reevaluating its current approach, making decisions about the technology
used when storing your data can have huge implications for costs and performance
in downstream analytics. As a platform focused on computation and analytics,
Databricks seeks to help our customers make choices that unlock new
opportunities, reduce redundancies, and connect data teams. In this course, you’ll
start by exploring the characteristics of data lakes, and data warehouses, two
popular data storage technologies. Then, you’ll learn about the Lakehouse, a new
data storage system invented and made popular by Databricks.
Prerequisites:
● Beginning knowledge about the Databricks Unified Data Analytics Platform.
● We recommend taking the courses: Fundamentals of Big Data and
Fundamentals of Unified Data Analytics with Databricks prior to taking this
course.
Learning objectives:
● Describe the strengths and limitations of data lakes, related to data storage.
● Describe the strengths and limitations of data warehouses, related to data
storage.
● Contrast data lake and data warehouse characteristics.
● Compare the features of a Lakehouse to the features of popular data storage
management solutions.
Fundamentals of Lakehouse Architecture
Type: Self-paced; Cost: Free for customers
Duration: 10 minutes
Course description: In this ten-minute video, you will learn about the Lakehouse, a
new data management architectural pattern that offers state-of-the-art support
and performance for data science, machine learning, and business analytics
applications. Furthermore, we will explore how Delta Lake plays an integral part in
creating a Lakehouse.
Prerequisites:
● None
Learning objectives:
● Identify analytics as a key component for data-driven decision making.
● Describe differences between common types of data storage solutions.
● Define the term "lakehouse" with respect to data management systems.
● Cite types of optimizations you can use with Delta Engine.
Fundamentals of Machine Learning
Type: Self-paced; Cost: Free for customers
Duration: 1 hour
Course description: In this course you’ll learn fundamental concepts about machine
learning. First, we’ll review machine learning basics - what it is, why it’s used, and
how it relates to data science. Then, we’ll explore the two primary categories that
machine learning problems are categorized into - supervised and unsupervised
learning. Finally, we’ll review how the machine learning workflow fits into the data
science process.
Prerequisites & Requirements
● Prerequisites
○ Beginning knowledge about concepts related to the big data landscape
helpful but not required (i.e. big data types, analysis techniques,
processing techniques, etc.)
○ We recommend taking the Databricks Academy course "Introduction to
Big Data" before taking this course.
Learning objectives
● Explain how machine learning is used as an analysis tool in data science.
● Summarize the relationship between the data science process and the
machine learning workflow.
● Describe the two primary categories that machine learning problems are
categorized into.
● Describe popular machine learning techniques within the two primary
categories of machine learning.
● Determine the machine learning technique that should be used to analyze
data in a given real-world scenario.
Fundamentals of Structured Streaming
Type: Self-paced; Cost: Free for customers
Duration: 1 hour
Course description: A common struggle that organizations face is how to accurately
ingest and perform calculations on real-time data. This data is also referred to as
streaming data, and the challenges behind working with it lie in its real-time nature -
because it is constantly arriving, mechanisms must be put into place to process and
write to a data store. In this course, you’ll learn about Structured Streaming, an
Apache Spark API that helps data practitioners overcome the challenges of working
with streaming data. We’ll cover fundamental concepts about batch and streaming
data to help set the stage for our discussion on Structured Streaming. Then, we’ll
discuss where Structured Streaming fits into an organization’s big data ecosystem.
Finally, we’ll review real-world Structured Streaming business use cases.
Prerequisites & Requirements
● Prerequisites
○ Beginning knowledge about the Databricks Unified Data Analytics
Platform (what it is, what it is used for)
○ Beginning knowledge about concepts related to the big data landscape
(for example: structured streaming, batch processing, data pipelines)
○ Note: We recommend taking the following two Databricks Academy
courses to help you prepare for this course: Fundamentals of Big Data
and Fundamentals of Unified Data Analytics with Databricks.
Learning objectives
● Explain the benefits of Structured Streaming for working with streaming data.
● Distinguish where Structured Streaming fits into an organization’s big data
ecosystem.
● Articulate examples of real-world business use cases for Structured
Streaming.
● Describe popular machine learning techniques within the two primary
categories of machine learning.
Fundamentals of the Databricks
Lakehouse Platform
Type: Self-paced; Cost: Free for customers
Duration: 30 minutes
Course description: This course is designed for everyone who is brand new to the
Platform and wants to learn more about what it is, why it was developed, what it
does, and the components that make it up.
Our goal is that by the time you finish this course, you’ll have a better understanding
of the Platform in general and be able to answer questions like: What is Databricks?
Where does Databricks fit into my workflow? How have other customers been
successful with Databricks?
NOTE: THIS COURSE DOES NOT CONTAIN HANDS-ON PRACTICE WITH THE
DATABRICKS LAKEHOUSE PLATFORM. IT WAS DEVELOPED AS A PREREQUISITE TO
OTHER COURSES WHICH DO HAVE HANDS-ON COMPONENTS.
Prerequisites:
● Experience using a web browser.
Learning objectives:
● Describe what the Databricks Lakehouse Platform is.
● Explain the origins of the Lakehouse data management paradigm.
● Outline fundamental problems that cause most enterprises to struggle with
managing and making use of their data.
● Identify the most popular components of the Databricks Lakehouse Platform
used by data practitioners, depending on their unique role.
● Give examples of organizations that have used the Databricks Lakehouse
Platform to streamline big data processing and analytics.
● Describe security features that come built-in to the Databricks Lakehouse
Platform.
Google Cloud Fundamentals
Type: Self-paced; Cost: Free for customers
Duration: 1.5 hours
Course description: Learn the basics of Google Cloud and how to configure various
resources using the Cloud Console. This course begins with an overview of the
platform, key terminology, and core services. You will then learn essential IAM
concepts and how service accounts can be used to manage resources. You will also
learn about the function and use cases of several storage services, such as Cloud
Storage, Cloud SQL, and BigQuery. This course also covers virtual machine and
networking concepts in Compute Engine and VPC services. The course ends with an
overview of GKE clusters and Kubernetes concepts.
Prerequisites:
● Familiarity with basic cloud computing concepts (cloud computing, cloud
storage, virtual machine, database, data warehouse)
Learning objectives:
● Define basic concepts and core services in the Google Cloud Platform.
● Describe IAM concepts and how service accounts can be used to manage
resources.
● Identify use cases for storage services, such as Cloud Storage, Cloud SQL,
and BigQuery.
● Define virtual machine and networking concepts in Compute Engine and VPC
services.
● Describe Google Kubernetes Engine and the core components of Kubernetes
clusters.
How to Code-Along with Self-Paced
Courses
Type: Self-paced; Cost: Free for customers
Duration: 3 minutes
Course description: Some Databricks Academy courses are designed with hands-on
coding exercises that you can follow along with. This is a 2 minute video that guides
you through navigating through these courses and demonstrates the recommended
workflow for completing these coding exercises.
Prerequisites:
● Beginning knowledge about the Databricks Collaborative Data Science
Workspace environment
● Beginning knowledge about uploading data to the Collaborative Data Science
Workspace
Learning objectives:
● Describe how to complete coding exercises in code-based self-paced
courses.
How to Manage Clusters in Databricks
Type: Self-paced; Cost: Free for customers
Duration: 15 minutes
Course description: In this course, you’ll learn a series of skills for working with and
configuring clusters in the Databricks Collaborative Data Science Workspace
(Workspace) including exploring cluster functions and creating, displaying, cloning,
editing, pinning, terminating, and deleting a cluster.
Prerequisites:
● Beginning knowledge about the Databricks Collaborative Data Science
Workspace environment
Learning objectives:
● Use the Databricks Workspace to perform a variety of cluster management
tasks.
Introduction to Apache Spark
Architecture
Type: Self-paced; Cost: Free for customers
Duration: 1 hour
Course description: In this course, you will explore how Apache Spark executes a
series of queries. Examples will include simple narrow transformations and more
complex wide transformations.
This course will give developers a working understanding of how to write code that
leverages the power of Apache Spark for even the simplest of queries.
Prerequisites:
● Familiarity with basic information about Apache Spark (what it is, what it is
used for)
Learning objectives:
● Explain how Apache Spark applications are divided into jobs, stages, and
tasks.
● Explain the major components of Apache Spark's distributed architecture.
Introduction to Applied Linear Models
Type: Self-paced; Cost: Free for customers
Duration: 1 hour
Course description: Linear modeling is a popular starting point for machine learning
studies for a number of reasons. Generally, these models are relatively easy to
interpret and explain, and they can be applied to a broad range of problems. In this
course, you will learn how to choose, apply, and evaluate commonly used linear
modeling techniques. As you work through the course, you can put your new skills to
practice in 5 hands-on labs.
Prerequisites & Requirements
● Prerequisites
○ Intermediate experience with machine learning (experience using
machine learning and data science libraries like scikit-learn and
Pandas, knowledge of linear models).
○ Intermediate experience using the Databricks Workspace to perform
data analysis (using Spark DataFrames, Databricks notebooks, etc.).
○ Beginning experience with statistical concepts commonly used in data
science.
Learning objectives
● Describe and evaluate linear regression for regression problems.
● Describe how to ensure machine learning models generalize to out-of-sample
data.
● Describe and evaluate logistic regression for classification problems.
● Practice using linear modeling techniques using the Databricks Data Science
Workspace.
Introduction to Applied Statistics
Type: Self-paced; Cost: Free for customers
Duration: 1 hour
Course description: In this course you’ll learn, both in theory and in practice, about
statistical techniques that are fundamental to many data science projects.
Throughout the course, videos will guide you through the conceptual information
you need to know about these statistical concepts, and hands-on lab activities will
give you the chance to apply the concepts you learn using the Databricks
Workspace. This course is divided into three modules: Introduction to Statistics and
Probability, Probability Distributions, and Applying Statistics to Learn from Data.
Prerequisites & Requirements
● Prerequisites
○ Beginning experience using the Databricks Data Science Workspace
(familiarity with Spark SQL, experience importing files into the
Databricks Data Science Workspace)
○ Beginning experience using Python (ability to follow guided use of the
SciPy library)
Learning objectives
● Contrast descriptive statistics and inferential statistics.
● Explain fundamental concepts behind discrete probability.
● Compare and contrast discrete and continuous probability distributions.
● Explain how discrete and continuous probability distributions can be used to
model data.
● Apply hypothesis testing techniques to learn from data.
Introduction to Applied Tree-Based
Models
Type: Self-paced; Cost: Free for customers
Duration: 3 hours
Course description: In this course, you’ll learn how to solve complex supervised
learning problems using tree-based models. First, we’ll explain how decision trees
can be used to identify complex relationships in data. Then, we’ll show you how to
develop a random forest model to build upon decision trees and improve model
generalization. Finally, we’ll introduce you to various techniques that you can use to
account for class imbalances in a dataset. Throughout the course, you’ll have the
opportunity to practice concepts learned in hands-on labs.
Prerequisites:
● Intermediate level knowledge about machine learning/machine learning
workflows (feature engineering and selection, applying tree-based models,
etc.)
● We recommend that you take the following courses prior to taking this
course: Fundamentals of Machine Learning, Introduction to Feature
Engineering and Selection with Databricks, Applied Unsupervised Learning
with Databricks.
Learning objectives:
● Describe how decision trees are used to solve supervised learning problems.
● Identify complex relationships in data using decision trees.
● Develop a random forest model to build upon decision trees and improve
model generalization.
● Employ common techniques to account for class imbalances in a dataset.
Introduction to Applied Unsupervised
Learning
Type: Self-paced; Cost: Free for customers
Duration: 3 hours
Course description: In this course, we will describe and demonstrate how to learn
from data using unsupervised learning techniques during exploratory data analysis.
The course is divided into two sections – one of which will focus on K-means
clustering and the other will describe principal components analysis, commonly
referred to as PCA. Each section includes demonstrations of important concepts, a
quiz to solidify your understanding, and a lab to practice your skills.
Prerequisites:
● Intermediate experience with machine learning (experience using machine
learning and data science libraries like scikit-learn and Pandas, knowledge of
linear models)
● Intermediate experience using the Databricks Workspace to perform data
analysis (using Spark DataFrames, Databricks notebooks, etc.)
● Beginning experience with machine learning concepts.
Learning objectives:
● Identify relationships between data records using K-means clustering.
● Identify patterns in a high-dimensional feature space using principal
components analysis.
● Learn from data using unsupervised learning techniques during exploratory
data analysis.
Introduction to AutoML
Type: Self-paced; Cost: Free for customers
Duration: 30 minutes
Course description: This course serves as a quick introduction to Databricks ML.
With AutoML, ML experts can accelerate their workflows by fast-forwarding through
the usual trial-and-error and focus on customizations using their domain knowledge,
and citizen data scientists can quickly achieve usable results with a low-code
approach.
Prerequisites:
● Familiarity with Databricks Machine Learning.
Learning objectives:
● Explain what AutoML is.
● Describe how AutoML fits into the Databricks ecosystem.
● Articulate how to use AutoML for appropriate use cases.
● Demonstrate how to use AutoML for given use cases.
Introduction to Cloning with Delta Lake
Type: Self-paced; Cost: Free for customers
Duration: 30 minutes
Course description: The addition of clone to Delta Lake empowers data engineers
and administrators to easily replicate data stored in the Lakehouse. Organizations
can use deep clone to archive versions of their production tables for regulatory
compliance. Developers can easily create development datasets isolated from
production data with shallow clone. In this course, you’ll learn the basics of cloning
with Delta Lake and get hands-on experience working with the syntax.
Prerequisites:
● Hands-on experience working with Delta Lake
● Intermediate experience with Spark and Databricks
Learning objectives:
● Describe the basic execution of deep and shallow clones with Delta Lake.
● Use deep clones to create full incremental backups of tables.
● Use shallow clones to create development datasets.
● Describe strengths, limitations, and caveats of each type of clone.
Introduction to Databricks Connect
Type: Self-paced; Cost: Free for customers
Duration: 40 minutes
Course description: In this course, participants will be introduced to DB Connect
through various presentations and demos. Participants will start by contrasting how
DB Connect works to other development patterns. Then we will explore the
simplicity by which DB Connect is installed and configured. And then we will
conclude with a real-time demonstration of an application running on a developer’s
local machine while executing its Spark jobs against a cluster in the Databricks
workspace.
Prerequisites:
● Intermediate experience using the Databricks Workspace
Learning objectives:
● Explain how Databricks Connect is used by data practitioners working with
Databricks.
● Install and configure Databricks Connect.
Introduction to Databricks Machine
Learning
Type: Self-paced; Cost: Free for customers
Duration: 2 hours
Course description: Databricks Machine Learning offers data scientists and other
machine learning practitioners a platform for completing and managing the
end-to-end machine learning lifecycle. This course guides practitioners through a
basic machine learning workflow using Databricks Machine Learning. Along the way,
students will learn how each of Databricks Machine Learning’s features better enable
data scientists and machine learning engineers to complete their work effectively
and efficiently.
Prerequisites:
● Beginning-level knowledge of the Databricks Lakehouse platform
● Intermediate-level knowledge of Python
● Intermediate-level knowledge of machine learning workflows
Learning objectives:
● Describe a basic overview of Databricks Machine Learning.
● Create a feature table for downstream modeling using Feature Store.
● Automatically develop a baseline model using AutoML.
● Manage the model lifecycle using Model Registry.
● Perform batch inference using the registered model and feature table.
● Schedule a monthly model refresh using Databricks Jobs and AutoML.
Introduction to Databricks Repos
Type: Self-paced; Cost: Free for customers
Duration: 30 minutes
Course description: Repos aims to make Databricks simple to use by giving data
scientists and engineers the familiar tools of git repositories and file systems. These
tools enable a more laptop-like developer experience for customers. Repos is the
new, top-level, customer-facing feature that packages these tools together in the
Databricks user interface. This course teaches how to get started with Repos.
Prerequisites:
● Familiarity with Git and Git commands
● Familiarity with Databricks workspaces
Learning objectives:
● Describe the motivations for Databricks Repos.
● Configure workspace integration with Github.
● Sync local and remote notebook changes using Repos
Introduction to Delta Lake
Type: Self-paced; Cost: Free for customers
Duration: 1 hour, 15 minutes
Course description: Delta Lake is a powerful tool created by Databricks. Delta Lake
is an open, reliable, performant and secure data storage and management layer for
your data lake that enables you to create a true single source of truth. Since it is
built upon Apache Spark, you’re able to build high performance data pipelines to
clean your data from raw ingestion to business level aggregates. Finally, given the
open format - it allows you to avoid unnecessary replication and proprietary lock-in.
Ultimately - it provides the reliability, performance, and security you need to serve
your downstream data use cases.
Prerequisites:
● Intermediate SQL skills (e.g. can do CRUD statements in SQL)
● Beginner experience with working on Databricks in the Data Science &
Engineering workspace or the Machine Learning workspace (e.g. can import
DBC files, can access workspaces). Also note: although this course relies
heavily on SQL as a language, this is not intended for learners who primarily
use the Databricks SQL workspace products.
● Beginner experience with working with data pipelines is helpful
Learning objectives:
● Describe the basic features and technical implementation of Delta Lake.
● Ingest data and manage Delta tables to keep data complete, up-to-date, and
organized.
● Optimize Delta performance using common strategies.
Introduction to Delta Live Tables
Type: Self-paced; Cost: Free for customers
Duration: 30 minutes
Course description: Delta Live Tables enables data teams to innovate rapidly with
simple development, using declarative tools to build and manage batch or streaming
data pipelines. Built-in quality controls and data quality monitoring ensure accurate
and useful BI, Data Science, and ML built on top of quality data. Delta Live Tables is
designed to scale with rapidly growing companies and provides clear observability
into pipeline operations and automatic error handling. This course will cover the
basics of this new product, including syntax, configuration, and deployment.
Prerequisites:
● Beginner experience working with PySpark or Spark SQL
● Basic familiarity with the Databricks workspace
Learning objectives:
● Describe the motivations for Delta Live Tables.
● Use PySpark or SQL syntax to declare Delta Live Tables.
● Schedule and deploy pipelines with the Databricks UI.
● Review pipeline logs and metrics.
Introduction to Feature Engineering and
Selection with Databricks
Type: Self-paced; Cost: Free for customers
Duration: 2.5 hours
Course description: As data practitioners work on supervised machine learning
solutions, they often need to manipulate data to ensure that it is compatible with
machine learning algorithm requirements and the model is meeting its objective.
This process is known as feature engineering, and the end result is to improve the
output of machine learning solutions. Once features are engineered, data
practitioners also commonly need to determine the best way to select the best
features to use in their machine learning projects.
In this course, you’ll learn how to perform both of these tasks. This course is divided
into two modules - in the first, you’ll explore feature engineering. In the second, you’ll
explore feature selection. Both modules will start with an introduction to these
topics - what they are and why they’re used. Then, you’ll review techniques that help
data practitioners perform these tasks. Finally, you’ll have the chance to perform two
hands-on lab activities - one where you will engineer features and another where
you will select features for a fictional machine learning scenario.
Prerequisites:
● Intermediate experience with machine learning (experience using machine
learning and data science libraries like scikit-learn and Pandas, knowledge of
linear models)
● Intermediate experience using the Databricks Workspace to perform data
analysis (using Spark DataFrames, Databricks notebooks, etc.)
● Beginning experience with statistical concepts commonly used in data
science
Learning objectives:
● Explain popular feature engineering techniques used to improve supervised
machine learning solutions.
● Explain popular feature selection techniques used to improve supervised
machine learning solutions.
● Engineer meaningful features for use in a supervised machine learning project
using the Databricks Data Science Workspace.
● Select meaningful features for use in a supervised machine learning project
using the Databricks Data Science Workspace.
Introduction to Files in Databricks Repos
Type: Self-paced; Cost: Free for customers
Duration: 30 minutes
Course description: This course teaches how to add non-notebook files to
Databricks Repos. Learners will connect a Databricks workspace to a hosted Git
repository. Next, they will import and store non-DBC and non-notebook files using
Databricks Repos. Then, they will import a markdown file and sync changes between
a Databricks Repo and a Git provider.
Prerequisites:
● Familiarity with Git and Git commands
● Familiarity with Databricks workspaces
● Completion of of the Introduction to Databricks Repos course
Learning objectives:
● Connect a Databricks workspace to a hosted Git repository using Databricks
Repos
● Import and store non-DBC and non-notebook files using already-configured
Databricks Repos with a Git provider
● Import a markdown file imported into workspace
● Sync changes within Databricks to a Git provider
Introduction to Hyperparameter
Optimization
Type: Self-paced; Cost: Free for customers
Duration: 2 hours
Course description: In this course, you’ll learn how to apply hyperparameter tuning
strategies to optimize machine learning models for unseen data. First, you’ll work
within a balanced binary classification problem setting where you’ll use random
forest to predict the correct class. You’ll learn to tune the hyperparameters of a
random forest to improve a model. Then, you’ll again work with a binary classification
problem using random forest and a technique known as cross-validation to
generalize a model.
Prerequisites:
● Intermediate level knowledge about machine learning/machine learning
workflows (feature engineering and selection, applying tree-based models,
etc.)
● We recommend that you take the following courses prior to taking this
course: Fundamentals of Machine Learning, Introduction to Feature
Engineering and Selection with Databricks, Introduction to Applied
Tree-based Models with Databricks.
Learning objectives:
● Explain common machine learning techniques that are used to optimize
machine learning models for unseen data.
● Apply machine learning techniques to improve the fit of machine learning
models.
● Apply machine learning techniques to improve the generalization of machine
learning models.
Introduction to Jobs
Type: Self-paced; Cost: Free for customers
Duration: 30 minutes
Course description: Databricks Jobs allow users to run applications in a
non-interactive way on a cluster. Jobs allow users to manage and orchestrate
production tasks, making it simple to promote notebooks from interactive
development to scheduled workloads. In this course, you’ll explore various features
of the Jobs UI as you orchestrate a simple pipeline.
Prerequisites:
● Intermediate knowledge of Python or SQL
● Beginning knowledge of software development principles (e.g. code
modularity, code scheduling, code orchestration)
● Beginning knowledge of navigating Databricks UI
Learning objectives:
● Describe jobs and motivations for using jobs in the workflow of data
practitioners.
● Create single task jobs with a scheduled trigger.
● Orchestrate multiple notebook tasks with the Jobs UI.
● Discuss common use cases and patterns for Jobs.
Introduction to MLflow Model Registry
Type: Self-paced; Cost: Free for customers
Duration: 30 minutes
Course description: This course will introduce you to MLflow Model Registry. Model
Registry is a centralized model management tool that allows you to track metrics,
parameters, and artifacts as part of experiments, package models and reproducible
ML projects, and deploy models to batch or real-time serving platforms. You will
learn how your team can use Model Registry as a central place to share ML models,
collaborate on moving them from experimentation to testing and production, and
implement approval and governance workflows.
Prerequisites:
● Beginner-level experience with machine learning.
● Beginner-level experience with MLflow Model Tracking.
● Beginner-level experience with Python.
● Beginner-level experience with Apache Spark on Databricks.
Learning objectives:
● Describe the components and functionalities of Model Registry.
● Explain the benefits of using Model Registry for machine learning model
management.
● Describe how Model Registry fits into the ML lifecycle with Databricks
Machine Learning.
● Demonstrate how to use Model Registry to perform essential tasks in the ML
workflow.
Introduction to Multi-Task Jobs
Type: Self-paced; Cost: Free for customers
Duration: 30 minutes
Course description:After a recap of single-task jobs, as well the directed acyclic
graph (DAG) model, you will learn how to create, trigger or schedule Multi-Task jobs
in Databricks.
Prerequisites:
● Experience working with the Databricks Workspace
Learning objectives:
● Explain what Multi-Task Jobs are.
● Describe how Multi-Task Jobs fits into the Databricks ecosystem.
● Articulate how to use Multi-Task Jobs for appropriate use cases.
● Demonstrate how to use Multi-Task Jobs.
Introduction to Natural Language
Processing
Type: Self-paced; Cost: Free for customers
Duration: 4 hours
Course description: This course will introduce you to natural language processing
with Databricks. You will learn how to generate
term-frequency-inverse-document-frequency (TFIDF) vectors for your datasets
and how to perform latent semantic analysis using the Databricks Machine Learning
Runtime.
Prerequisites:
● Intermediate experience performing machine learning/data science workflows
● Intermediate experience using the Databricks Data Science Workspace to
perform machine learning workflows
Learning objectives:
● Describe foundational concepts about how latent semantic analysis is used
to analyze text data.
● Perform latent semantic analysis using the Databricks Machine Learning
Runtime with the Databricks Workspace.
● Generate TFIDF vectors to reduce the noise in a dataset being used for latent
semantic analysis in a Databricks Workspace.
Introduction to Photon
Type: Self-paced; Cost: Free for customers
Duration: 30 minutes
Course description: In this course, you’ll learn how Photon can be used to reduce
Databricks total cost of ownership (TCO) and dramatically improve query
performance. You’ll also learn best practices for when to use and not use Photon
Finally, the course will include a demonstration of a query run with and without
Photon to show improvement in query performance.
Prerequisites:
● Administrator privileges
● Introductory knowledge about the Databricks Lakehouse Platform (what the
Databricks Lakehouse Platform is, what it does, main components, etc.)
Learning objectives:
● Explain fundamental concepts about Photon on Databricks.
● Describe the benefits of enabling Photon on Databricks.
● Identify queries that would benefit from using Photon
● Describe the performance differences between a query run with and without
Photon enabled
Introduction to SQL on Databricks
Type: Self-paced; Cost: Free for customers
Duration: 1 hour
Course description: Databricks, a managed platform for running Apache Spark,
provides a premier environment for processing SQL workloads. Spark SQL is a Spark
module for structured data processing. It can act as a distributed SQL query engine,
enabling queries to run up to 100x faster on existing deployments and data. Users
with a classical SQL background can immediately begin to work in the Databricks
SQL environment. Using Spark SQL on Databricks has multiple advantages over
using SQL with traditional tools.
Prerequisites & Requirements
● Prerequisites
○ Familiarity with SQL
Learning objectives
● Identify the benefits of using Spark SQL on Databricks
● Describe basic cluster computing concepts like parallelization
● Use Spark SQL on Databricks to run basic queries
● Explain how common functions and Databricks tools can be applied to
upload, view, and visualize data
Just Enough Python for Apache Spark
Type: Premium self-paced (Free for customers); Instructor-led course: $1000 USD
Duration: 1 day (6 hours)
Course description: This 1-day course aims to help participants with or without a
programming background develop just enough experience with Python to begin
using the Apache Spark programming APIs.
Prerequisites:
● Some experience in a structured programming language such as Javascript,
C++, or R is helpful
Learning objectives:
● Navigate the Python documentation
● Employ basic programming constructs such as conditional statements and
loops
● Use function and classes from existing libraries
● Create new functions and classes
● Identify and use the primary collection types
● Understand the breadth of the language's string functions (and other misc
utility functions)
● Employ basic exception handling
● Describe and possibly employ some of the key features of functional
programming
Course agenda:
● Part 1: Getting Started with Python
○ Key Python internet resources
○ How to run Python code in various environments
● Part 2: Variable and Data Types
○ Fundamental Python concepts
○ Introduction to 4 basic data types
○ Declare and assign variables
○ Employ simple, built-in functions
○ Develop and use assert statements
● Part 3: Conditionals and Loops
○ Create a simple list
○ Iterate over a list using a for expression
○ Conditionally execute statements using if, elif, and else expressions
● Part 4: Methods, Functions, and Packages
○ Develop and use functions with and without arguments and type hints
○ Use assert statements to “unit test” functions
○ Employ the help() function to learn about modules, functions, classes,
and keywords
○ Identify differences between functions and methods
○ Import libraries
● Part 5: Collections and Classes
○ Use list methods and syntax to append, remove, or replace elements of
a list
○ Compare ranges to lists
○ Define dictionaries
○ Use list and dictionary comprehensions to efficiently transform each
element of each data structure
○ Define classes and methods
Just Enough Scala for Apache Spark
Type: Instructor-led course: $1000 USD
Duration: 1 day (6 hours)
Course description: This 1-day course aims to help participants with or without a
programming background develop just enough experience with Scala to begin using
the Apache Spark programming APIs.
Prerequisites:
● Some experience in a structured programming language such as Javascript,
C++, or R is helpful
Learning objectives:
● Navigate the Scala documentation
● Employ basic programming constructs such as conditional statements and
loops
● Use function and classes from existing libraries
● Create new functions and classes
● Identify and use the primary collection types
● Understand the breadth of the language's string functions (and other misc
utility functions)
● Employ basic exception handling
Course agenda:
● Coming soon.
Lakehouse with Delta Lake Deep Dive
Type: Self-paced; Cost: Free for customers
Duration: 3 hours
Course description: This course begins with an overview of the Lakehouse
architecture, and an in-depth look at key Delta Lake features and functionality that
make a Lakehouse possible. Participants will build end-to-end OLAP data pipelines
using Delta Lake for batch and streaming data. The course also discusses serving
data to end users through aggregate tables and Databricks SQL Analytics.
Throughout the course, emphasis will be placed on using data engineering best
practices with Databricks.
Prerequisites:
● Intermediate to advanced SQL skills
● Intermediate to advanced Python skills
● Beginning experience using the Spark DataFrames API
● Beginning knowledge of general data engineering concepts
● Beginning knowledge of the core features and use cases of Delta Lake
Learning objectives:
● Identify the core components of Delta Lake that make a Lakehouse possible.
● Define commonly used optimizations available in Delta Engine.
● Build end-to-end batch and streaming OLAP data pipeline using Delta Lake.
● Make data available for consumption by downstream stakeholders using
specified design patterns.
● Document data at the table level to promote data discovery and cross-team
communication.
● Apply Databricks’ recommended best practices in engineering a single source
of truth Delta architecture.
Machine Learning in Production: MLflow
and Model Deployment
Type: Instructor-led course; $1000 USD
Duration: 1 day (6 hours)
Course description: This course is separated into two main components. The first
uses MLflow as the backbone for machine learning development and production.
This includes tracking the machine learning lifecycle, packaging projects for
deployment, using the MLflow model registry, and more. The second component
looks at various production issues, the four main deployment paradigms, monitoring,
and alerting. Depending on the desires of the class, numerous electives are also
available on the various MLflow functionality and deployment scenarios.
By the end of this course, you will have built the infrastructure to manage the
development, deployment, and monitoring of models in production. This course is
taught entirely in Python.
Prerequisites:
● Intermediate experience using Python/pandas
● Familiarity with Apache Spark
● Working knowledge of machine learning and data science (scikit-learn,
TensorFlow, etc.)
● Basic familiarity with object storage, databases, and networking
Learning objectives:
● Track machine learning experiments to organize the machine learning life
cycle
● Create, organize, and package machine learning projects with a focus on
reproducibility and collaborating with a team
● Develop a generalizable way of handling machine learning models created in
and deployed to a variety of environments
● Explore the various production issues encountered in deploying and
monitoring machine learning models
● Introduce various strategies for deploying models using batch, streaming, and
real-time
● Explore solutions to drift and implement a basic retraining method and two
ways of dynamically monitoring drift
Course agenda:
● What is ML deployment?
● The Four Deployment Paradigms
● Deployment Requirements
● Deployment Architectures
● Other Issues
Migrating SAS Procedures to Databricks
Type: Self-paced; Cost: Free for customers
Duration: 30 minutes
Course description: This course will enable experienced SAS developers to quickly
learn how to translate familiar SAS statements and functions into code that can be
run on Databricks. It begins with an introduction to the Databricks environment and
the different approaches to coding in Databricks, followed by an overview of how
SAS PROC and DATA steps can be performed in Databricks.You will learn about how
you can use Spark SQL, PySpark, and other tools to read .sas7bdat files and perform
common operations. Finally, you will see code examples and gain hands-on practice
performing some of the most common SAS operations in Databricks.
Prerequisites:
● Intermediate to advanced SAS programming experience
● Beginning knowledge of Python programming
● Beginning-level experience with SQL
Learning objectives:
● Read data stored in .sas7bdat files using Spark SQL and PySpark.
● Explain the conceptual and syntactical relationships between SAS DATA and
PROC statements and their correlaries on Databricks.
● Learn how Python can be leveraged to augment ANSI SQL to create reusable
Spark SQL code.
● Translate common PROC functions to Databricks.
● Translate common DATA steps to Databricks.
Natural Language Processing at Scale with
Databricks
Type: Self-paced; Cost: Free for customers
Duration: 5 hours
Course description: This five-hour course will teach you how to do natural language
processing at scale on Databricks. You will apply libraries such as NLTK and Gensim
in a distributed setting as well as SparkML/MLlib to solve classification, sentiment
analysis, and text wrangling tasks. You will learn how to remove stop words, when to
lemmatize vs stem your tokens, and how to generate
term-frequency-inverse-document-frequency (TFIDF) vectors for your dataset. You
will also use dimensionality reduction techniques to visualize word embeddings with
Tensorboard and apply and visualize basic vector arithmetic to embeddings.
Prerequisites:
● Experience working with PySpark DataFrames
● Mastery of concepts presented in the Databricks Academy "Apache Spark
Programming" course
● Mastery of concepts presented in the Databricks Academy "Scalable Machine
Learning with Apache Spark" course
Learning objectives:
● Explain the motivation behind using Natural Language Processing to analyze
data.
● Identify distributed Natural Language Processing libraries commonly used
when analyzing data.
● Perform a series of Natural Language Processing workflows in the Databricks
Data Science Workspace
Optimizing Apache Spark on Databricks
Type: Self-Paced (free for customers); Instructor-led ($1000 USD)
Duration: 1 day (6 hours)
Course description: In this course, students will explore five key problems that
represent the vast majority of performance problems in an Apache Spark
application: Skew, Spill, Shuffle, Storage, and Serialization. With each of these topics,
we explore examples that demonstrate how these problems are introduced, how to
diagnose these problems with tools like the Spark UI, and conclude by discussing
mitigation strategies for each of these problems.
The course will also address concepts including:
● Key ingestion concepts
● New features like Adaptive Query Execution and Dynamic Partition Pruning
● Configuring clusters for optimal performance
Prerequisites:
● Intermediate to advanced programming experience in Python or Scala
● Hands-on experience developing Apache Spark applications
Learning objectives:
● Articulate how the five most common performance problems in a Spark
application can be mitigated to achieve better application performance.
● Summarize some of the most common performance problems associated
with data ingestion and how to mitigate them.
● Articulate how new features in Spark 3.0 can be employed to mitigate
performance problems in your Spark applications.
● Configure a Spark cluster for maximum performance given specific job
requirements and while considering a multitude of other factors.
Course agenda:
● Day 1 AM
○ The Five Most Common Performance Problems
■ Introduction / Benchmarking
■ Skew
■ Spill
■ Shuffle
■ Storage
■ Serialization
● Day 1 PM
○ Key Ingestion Concepts
■ Ingestion Basics
■ Predicate Push Downs
■ Disk Partitioning
■ Z-Ordering
■ Bucketing
○ Optimizing with AQE and DPP
■ Tuning Shuffle Partitions
■ Join Optimizations
■ Skew Join Optimizations
■ Dynamic Partition Pruning
○ Designing Clusters for High Performance
■ Designing Clusters for High Performance
■ Cluster Configurations Scenarios
■ Designing Clusters Breakout
● Optional review topics
○ Introduction to Spark Architecture
○ Spark UI Demo
Propagating Changes with Delta Change
Data Feed
Type: Self-paced; Cost: Free for customers
Duration: 1 hour
Course description: A Delta change data feed represents row-level changes
between versions of a Delta table. When enabled on a Delta table, the runtime
records “change events” for all the data written into the table. This includes the row
data along with metadata indicating whether the specified row was inserted,
deleted, or updated. In this course, we'll examine some of the motivations and use
cases for this feature and see it in action.
Prerequisites:
● Basic knowledge of Spark Structured Streaming APIs
● Basic knowledge of Delta Lake
Learning objectives:
● Describe how Delta Change Data Feed emits change data records.
● Use appropriate syntax and settings to set up Change Data Feed.
● Propagate inserts, updates, and deletes with Change Data Feed.
Quick Reference: CI/CD
Type: Self-paced, Cost: Free for customers
Duration: 30 minutes
Course description: This quick-reference provides an overview of fundamental
concepts behind CI/CD. While the Databricks tools and integrations mentioned in
this course can be used by DevOps teams for CI/CD, this course was designed to
summarize what happens during each stage of a CI/CD pipeline (not provide a
technical how-to into each of these stages). Future courses will dive into each of
these stages in greater detail. Note: We will use Jenkins as an example automation
system in this course.
Prerequisites:
● Beginning-level experience with CICD, DevOps and/or the software
development lifecycle (not necessarily on Databricks)
Learning objectives:
● Summarize each stage in a traditional CI/CD pipeline.
● Outline the steps in configuring the Jenkins automation agent for use in
CI/CD.
Quick Reference: Databricks Workspace
User Interface
Type: Self-paced; Cost: Free for customers
Duration: 10 minutes
Course description: This is a short, ten-minute introduction to the Databricks
Collaborative Data Science Workspace (Workspace). If you are new to Databricks,
we recommend taking this course to familiarize yourself with the layout of the
Databricks Workspace.
Prerequisites:
● Beginning-level knowledge of big data and data science concepts.
● Beginning-level knowledge of the functionality within the Unified Data
Analytics Platform.
Learning objectives:
● Define vocabulary terms relevant to data practitioners working in the
Workspace.
● Navigate to major components of the Workspace.
Quick Reference: Managing Databricks
Notebooks with the Databricks
Workspace
Type: Self-paced; Cost: Free for customers
Duration: 5 minutes
Course description: This course includes a series of short videos that show how to
perform tasks to manage Databricks notebooks including creating, opening, deleting,
and distributing notebooks. It also includes information on attaching and detaching
notebooks to clusters and controlling access to notebooks. This course does not
cover how to analyze data with data using notebooks.
Prerequisites:
● Beginning experience with the Databricks Unified Data Analytics Platform
helpful but not required.
● Beginning knowledge about data science and big data concepts helpful but
not required.
Learning objectives:
● Execute basic Databricks notebooks management tasks in the Collaborative
Data Science Workspace.
Quick Reference: Relational Entities on
Databricks
Type: Self-paced; Cost: Free for customers
Duration: 1 hour
Course description: While the syntax for creating and working with databases,
tables, and views will be familiar to most SQL users, some default behaviors may
surprise users new to Databricks. In this short course, you’ll learn how to create
databases, tables, and views on Databricks. Special attention will be given to
differences in scope and persistence for these various entities, allowing any user
that will be responsible for creating or managing databases, tables, or views to make
informed decisions for their organization.
Prerequisites:
● Beginning knowledge of SQL
● Beginning knowledge of loading and interacting with data files in Spark
Learning objectives:
● Write basic queries that create databases, tables, and views.
● Describe how relational entities are managed by the catalog on Databricks.
● Describe how the LOCATION keyword changes the default behavior for
database contents.
● Describe the differences in syntax and performance for managed and
unmanaged tables.
● Describe the differences in scope and persistence between views, temp
views, and global temp views.
Quick Reference: Spark Architecture
Type: Self-paced; Cost: Free for customers
Duration: 30 minutes
Course description: Apache Spark™ is a unified analytics engine for large scale data
processing known for its speed, ease and breadth of use, ability to access diverse
data sources, and APIs built to support a wide range of use-cases. Databricks builds
on top of Spark and adds many performance and security enhancements. This
course is meant to provide an overview of Spark’s internal architecture.
Prerequisites:
● Beginning knowledge of big data and data science concepts.
Learning objectives:
● Describe basic Spark architecture and define terminology such as “driver”
and “executor”
● Explain how parallelization allows Spark to improve speed and scalability of an
application
● Describe lazy evaluation and how it relates to pipelining
● Identify high-level events for each stage in the Optimization process
Scalable Deep Learning with TensorFlow
and Apache Spark™
Type: Instructor-led and self-paced; Cost: Instructor-led $1500 USD; self-paced -
free for customers
Duration: 2 days (12 hours)
Course description: This course starts with the basics of the tf.keras API including
defining model architectures, optimizers, and saving/loading models. You then
implement more advanced concepts such as callbacks, regularization, TensorBoard,
and activation functions. After training your models, you will integrate the MLflow
tracking API to reproduce and version your experiments. You will also apply model
interpretability libraries such as LIME and SHAP to understand how the network
generates predictions. You will also learn about various Convolutional Neural
Networks (CNNs) architectures and use them as a basis for transfer learning to
reduce model training time.
Substantial class time is spent on scaling your deep learning applications, from
distributed inference with pandas UDFs to distributed hyperparameter search with
Hyperopt to distributed model training with Horovod. This course is taught fully in
Python.
Prerequisites:
● Intermediate experience with Python/pandas
● Familiarity with machine learning concepts
● Experience with PySpark
Learning objectives:
● Build deep learning models using Keras/TensorFlow
● Scale the following:
● Model inference with pandas UDFs & pandas function API
● Hyperparameter tuning with HyperOpt
● Training of distributed TensorFlow models with Horovod
● Track, version, and reproduce experiments using MLflow
● Apply model interpretability libraries to understand & visualize model
predictions
● Use CNNs (convolutional neural networks) and perform transfer learning &
data augmentation to improve model performance
● Deploy deep learning models
Course agenda (instructor-led):
● Day 1 AM
○ Spark Review
○ Linear Regression
○ Keras
○ Keras lab
● Day 1 PM
○ Advanced Keras
○ Advanced Keras lab
○ MLflow
○ MLflow lab
○ Hyperopt
○ Hyperopt lab
○ Horovod
● Day 2 AM
○ Horovod Petastorm
○ Horovod lab
○ Moel interpretability
○ CNNs
● Day 2 PM
○ CNNs
○ SHAP for CNNs
○ Transfer learning
○ Data augmentation
○ Transfer learning lab
○ Model serving
○ Generative adversarial networks
○ Best practices
Scalable Machine Learning with Apache
Spark
Type: Instructor-led and self-paced; Cost: Instructor-led $1500 USD; self-paced -
free for customers
Duration: 2 days (12 hours)
Course description: This course guides students through the process of building
machine learning solutions using Spark. You will build and tune ML models with
SparkML using transformers, estimators, and pipelines. This course highlights some
of the key differences between SparkML and single-node libraries such as
scikit-learn. Furthermore, you will reproduce your experiments and version your
models using MLflow.You will also integrate 3rd party libraries into Spark workloads,
such as XGBoost. In addition, you will leverage Spark to scale inference of
single-node models and parallelize hyperparameter tuning.
Prerequisites:
● Intermediate experience with Python Beginning experience with the PySpark
DataFrame API (or have taken the Apache Spark Programming with Databricks
class)
● Working knowledge of machine learning and data science
Learning objectives:
● Create data processing pipelines with Spark.
● Build and tune machine learning models with Spark ML.
● Track, version, and deploy models with MLflow.
● Perform distributed hyperparameter tuning with Hyperopt.
● Use Spark to scale the inference of single-node models.
Course agenda (instructor-led):
● Day 1 AM
○ Apache Spark Overview
○ ML Overview
○ Data Cleansing
○ Data Exploration
○ Linear Regression I
● Day 1 PM
○ Linear Regression I Continued
○ Linear Regression II
○ MLflow Tracking
○ MLflow Model Registry
○ MLflow Lab
● Day 2 AM
○ Decision Trees
○ Hyperparameter Tuning
○ Hyperopt
● Day 2 PM
○ Hyperopt Continued
○ MLlib Deployment Options
○ XGBoost
○ Inference with Pandas UDFs
○ Training with Pandas UDFs
○ Koalas
○ Capstone Project
Service Overview: Databricks Data
Science & Engineering Workspace
Type: Self-paced; Cost: Free for customers
Duration: 2 hours
Course description: The Databricks Data Science and Engineering Workspace
(Workspace) provides a collaborative analytics platform to help data practitioners
get the most out of Databricks when it comes to data science and engineering
tasks. This course (formerly known as Databricks Data Science & Engineering
Workspace) guides practitioners through fundamental Workspace concepts and
components necessary to achieve a basic development workflow.
Prerequisites & Requirements
● Prerequisites
○ None.
Learning objectives
● Add a user to your Databricks environment and provide access to Databricks
SQL.
● Create and start a SQL endpoint to provide a computation resource for the
user.
● Configure access to the default database for the user to run SQL commands
on using the SQL endpoint.
SQL Coding Challenges
Type: Self-paced; Cost: Free for customers
Duration: 1 hour
Course description: Taking this course will familiarize you with the content and
format of the Associate SQL Analyst Accreditation, as well as provide you some
practical exercises that you can use to improve your skills or cement newly learned
concepts. We recommend that you complete Fundamentals of SQL on Databricks
and Applications of SQL on Databricks before using this guide.
Prerequisites:
● Intermediate-level ability with SQL
Learning objectives:
● Describe the format and scope of the SQL analyst accreditation
● Identify the scope of knowledge-based and practical topics covered
● Complete practical exercises to practice applying SQL skills on Databricks
Structured Streaming
Type: Self-paced; Cost: Free for customers
Duration:
Course description: This hands-on self-paced training course targets data
engineers who want to process big data using Apache Spark™ Structured Streaming.
The course is a series of four self-paced lessons. Each lesson includes hands-on
exercises. The course contains Databricks notebooks for both Azure Databricks and
AWS Databricks; you can run the course on either platform.
Prerequisites:
● Completion of Apache Spark Programming on Databricks course strongly
encouraged
Learning objectives:
● Use the interactive Databricks notebook environment
● Ingest streaming log file data
● Aggregate small batches of data with time windows
● Stream data from a Kafka connection
● Use Structured Streaming in conjunction with Databricks Delta
● Visualize streaming live data
● Use Structured Streaming to analyze streaming Twitter data
Tracking Experiments with MLflow
Type: Self-paced; Cost: Free for customers
Duration: 2 hours
Course description: In this course, we’ll show you how to design an MLflow
experiment to identify the best machine model for deployment. This course is the
second in a series of three courses developed to show you how to use Databricks to
work with a single data set from experimentation to production-scale machine
learning model deployment. The other courses in this series include:
● Data Science on Databricks: The Bias-Variance Tradeoff
● Deploying a Machine Learning Project with MLflow Projects
Prerequisites:
● Beginning-level experience running data science workflows in the Databricks
Workspace
● Beginner-level experience with Apache Spark
● Intermediate-level experience with the Scipy Numerical Stack
Learning objectives:
● Create and explore an augmented sample from user event and profile data.
● Design an MLflow experiment and write notebook-based software to run the
experiment to assess various linear models.
● Examine experimental results to decide which model to develop for
production.
What’s New in Apache Spark 3.0
Type: Self-paced; Cost: Free for customers
Duration: 30 minutes
Course description: This course was created to teach Databricks users about the
major improvements to Spark in the 3.0 release. It will give an overview of new
features meant to improve performance and usability. Students will also learn about
backwards compatibility with 2.x and some of the considerations required for
updating to Spark 3.0.
Prerequisites:
● Familiarity with Spark 2.x
Learning objectives:
● Describe major improvements to performance in Spark 3.0
● Identify major usability improvements in Spark 3.0
● Recognize relevant compatibility considerations for migrating to Spark 3.0
Credential descriptions
Azure Databricks Certified Associate
Platform Administrator
Type: Self-paced; Cost: Free for customers
Duration: 2 hours
Course description: The Azure Databricks Certified Associate Platform Administrator
certification exam assesses an understanding of network infrastructure and security
with Databricks, including workspace deployment, Azure cloud concepts, and
network security. The exam also assesses the understanding of identity and access
on Azure Databricks, including identity management, workspace access control,
data access control, and fine-grained security. In addition, the exam assesses
cluster configuration and usage management. Lastly, developer tools and
automation processes are assessed.
Prerequisites:
● The minimally qualified candidate should have:
○ have an intermediate understanding of network infrastructure and
security, including: workspace deployment, Azure cloud concepts,
network security
○ have a complete understanding of identity and access configurations,
including: identity management, workspace access control, data
access control, fine-grained security using SQL
○ have an intermediate understanding of cluster usage, including: clust
configuration and usage management
○ have a basic understanding of automation, including: developer tools,
automation processes
Databricks Certified Associate Developer
for Apache Spark 3.0
Type: Certification; Cost: $200 USD
Duration: 2 hours
Description: The Databricks Certified Associate Developer for Apache Spark 3.0
certification exam assesses the understanding of the Spark DataFrame API and the
ability to apply the Spark DataFrame API to complete basic data manipulation tasks
within a Spark session. These tasks include selecting, renaming and manipulating
columns; filtering, dropping, sorting, and aggregating rows; handling missing data;
combining, reading, writing and partitioning DataFrames with schemas; and working
with UDFs and Spark SQL functions. In addition, the exam will assess the basics of
the Spark architecture like execution/deployment modes, the execution hierarchy,
fault tolerance, garbage collection, and broadcasting.
Prerequisites:
● The minimally qualified candidate should:
○ have a basic understanding of the Spark architecture, including
Adaptive Query Execution
○ be able to apply the Spark DataFrame API to complete individual data
manipulation task, including:
○ selecting, renaming and manipulating columns
○ filtering, dropping, sorting, and aggregating rows
○ joining, reading, writing and partitioning DataFrames
○ working with UDFs and Spark SQL functions
It is expected that developers that have been using the Spark DataFrame API
for six months or more should be able to pass this certification exam.
While it will not be explicitly tested, the candidate must have a working
knowledge of either Python or Scala. The exam is available in both languages.
Databricks Certified Professional Data
Engineer
Type: Self-paced; Cost: Free for customers
Duration: 2 hours
Course description: The Databricks Certified Professional Data Engineer certification
exam assesses an individual’s ability to use Databricks to perform common data
engineering tasks. This includes an understanding of the Databricks platform and
developer tools like Apache Spark, Delta Lake, MLflow, and the Databricks CLI and
REST API. It also assesses the ability to build optimized and cleaned ETL pipelines.
Additionally, modeling data into a Lakehouse using knowledge of general data
modeling concepts will also be assessed. Finally, ensuring that data pipelines are
secure, reliable, monitored, and tested before deployment will also be included in
this exam. Individuals who pass this certification exam can be expected to complete
data engineering tasks using Databricks and its associated tools.
Prerequisites:
● The minimally qualified candidate should have:
○ a complete understanding of the basics of machine learning, including:
the bias-variance tradeoff, in-sample vs. our of sample data, categories
of machine learning, applied statistics concepts
○ an intermediate understanding of the steps in the machine learning
lifecycle, including: data preparation, feature engineering, model
training, model selection and model production, interpreting models
○ a complete understanding of basic machine learning algorithms and
techniques, including: linear, logistic, and regularized regression,
tree-based models like decision trees, random forest and gradient
boosted trees, unsupervised techniques like K-means and PCA,
specific algorithms like ALS for recommendation and isolation forests
for outlier detection
○ a complete understanding of the basics of machine learning model
management like logging and model organization with MLflow
Databricks Certified Professional Data
Scientist
Type: Self-paced; Cost: Free for customers
Duration: 2 hours
Course description: The Databricks Certified Professional Data Scientist certification
exam assesses the understanding of the basics of machine learning and the steps in
the machine learning lifecycle, including data preparation, feature engineering, the
training of models, model selection, interpreting models, and the production of
models. The exam also assesses the understanding of basic machine learning
algorithms and techniques, including linear regression, logistic regression,
regularization, decision trees, tree-based ensembles, basic clustering algorithms,
and matrix factorization techniques. The basics of model management with MLflow,
like logging and model organization, are also assessed.
Prerequisites:
● The minimally qualified candidate should have:
○ a complete understanding of the basics of machine learning, including:
the bias-variance tradeoff, in-sample vs. our of sample data, categories
of machine learning, applied statistics concepts
○ an intermediate understanding of the steps in the machine learning
lifecycle, including: data preparation, feature engineering, model
training, model selection and model production, interpreting models
○ a complete understanding of basic machine learning algorithms and
techniques, including: linear, logistic, and regularized regression,
tree-based models like decision trees, random forest and gradient
boosted trees, unsupervised techniques like K-means and PCA,
specific algorithms like ALS for recommendation and isolation forests
for outlier detection
○ a complete understanding of the basics of machine learning model
management like logging and model organization with MLflow
Fundamentals of the Databricks
Lakehouse Platform Accreditation
Type: Self-paced; Cost: Free for customers
Duration: 30 minutes
Accreditation description: This is a 30-minute assessment that will test your
knowledge about fundamental concepts related to the Databricks Lakehouse
Platform. Questions will assess how well you know about the platform in general, how
familiar you are with the individual components of the platform, and your ability to
describe how the platform helps organizations accomplish their data engineering,
data science/machine learning, and business/SQL analytics use cases. Please note
that this assessment will not test your ability to perform tasks using Databricks
functionality. Instead, it will test how well you can explain components of the
platform and how they fit together.
After successfully completing this assessment, you will be awarded a Databricks
Lakehouse Platform badge.
This accreditation is the beginning step in most of the Databricks Academy learning
plans - SQL Analysts, Data Scientists, Data Engineers, and Platform Administrators.
Business leaders are also welcome to take this assessment.
Prerequisites:
● We recommend that you take the following courses to prepare for this
accreditation exam:
○ What is the Databricks Lakehouse Platform?
○ What are Enterprise Data Management Systems? (particularly the
section on Lakehouse architecture)
○ What is Delta Lake?
○ What is Databricks SQL?
○ What is Databricks Machine Learning?
SQL Analyst Accreditation
Type: Self-paced; Cost: Free for customers
Duration: 1 hour
Accreditation description: In this 1-hour accreditation exam, you will demonstrate
your ability to use Apache Spark SQL to query, transform, and present data.
Prerequisites:
● Intermediate experience with SQL.