Watchtower - Databricks Logging Solution

A solution accelerator to standardizing, centralizing, and ingesting Databricks cluster logs for enhanced log searching, troubleshooting, and optimization.

The goal of this repo is to provide platform administrators and data engineers a reference implementation for scalable log ingestion, as well as broader guidance for practitioners to use logging in jobs.

The included dashboard automatically detects common Spark issues without requiring manual log analysis. It provides pre-built pattern detectors for performance problems (OutOfMemory errors, spill to disk, lost executors) and privacy concerns (df.show() statements in logs), alongside flexible log search and Top N analysis capabilities. This allows you to quickly identify and troubleshoot job failures by surfacing the most relevant issues in an easily visible way.

Prerequisites

Before deploying this solution, ensure you have:

Unity Catalog: A catalog named watchtower must be created in your workspace (or provide your own catalog name by modifying the catalog variable in databricks.yml)
SQL Warehouse: A SQL warehouse must be available and referenced in databricks.yml (default: "Shared Unity Catalog Serverless")

Getting started

Install the Databricks CLI from https://docs.databricks.com/dev-tools/cli/install.html
Authenticate to your Databricks workspace (if you have not done so already):
```
$ databricks configure
```
To deploy a development copy of this project, type:
```
$ databricks bundle deploy --target dev
```
(Note that "dev" is the default target, so the --target parameter is optional here.)

This deploys everything that's defined for this project. For example, the default template would deploy a Lakeflow Pipeline called [dev yourname] watchtower_pipeline to your workspace. You can find that pipeline by opening your workpace and clicking on Jobs & Pipelines.
Similarly, to deploy a production copy, configure a prod target in databricks.yml and type:
```
$ databricks bundle deploy --target prod
```

To deploy and run integration tests:

$ databricks bundle deploy --target staging
$ databricks bundle run integration_tests --target staging

To run a job, use the "run" command:
```
$ databricks bundle run [KEY]
```
Optionally, install developer tools such as the Databricks extension for Visual Studio Code from https://docs.databricks.com/dev-tools/vscode-ext.html.
For documentation on the Databricks Asset Bundles format used for this project, and for CI/CD configuration, see https://docs.databricks.com/dev-tools/bundles/index.html.

Teardown and Cleanup

To completely tear down all resources that were deploeyd as part of this asset bundle:

databricks bundle destroy --target dev

If you also used the provided Terraform to deploy your Unity Catalog Volume resources, you can tear those down as well:

cd terraform && terraform destroy && cd ..

Linting and Code Analysis

To run the linter by itself:

hatch run lint:check

To run the linter with automatic fixing (when possible):

hatch run lint:fix

Run all static analysis tools with one command:

hatch run analyze:all

Run Tests

To run unit tests:

hatch run test:test

CI/CD Setup (Optional)

To enable automatic testing on Pull Requests:

Add GitHub Repository Secrets:
- Go to your repo → Settings → Secrets and variables → Actions
- Add secret: DATABRICKS_TOKEN (your Databricks token)
- Optionally add variable: DATABRICKS_HOST
What happens automatically:
- Pull Requests: Validated and tested with isolated workspace paths
- Main branch: Deployed to your dev environment
- PR cleanup: Resources automatically cleaned up when PR is closed

What Gets Deployed

Pipeline: Databricks log ingestion pipeline
Workflow: watchtower_demo_job
Dashboard: Logs Dashboard (deployed alongside pipeline) - includes log search with filtering, Top N analysis for errors and jobs, and pre-built detectors for common issues like OOM errors, spill to disk, lost executors, and privacy concerns
Location: /Workspace/Users/your-email@company.com/.bundle/watchtower/

Manual Commands (if you prefer)

cd terraform && terraform init && terraform apply && cd .. # Deploy Catalog resources and init scripts
databricks bundle validate    # Check configuration
databricks bundle deploy      # Deploy to workspace
databricks bundle run demo_workflow # Run the demo workflow
databricks bundle summary     # See what's deployed
databricks bundle destroy     # Remove everything
cd terraform && terraform destroy && cd ..

Raw Log File Retention

Watchtower automatically manages the cleanup of processed raw log files using Databricks autoloader's cleanSource feature. This helps control storage costs and meet data retention compliance requirements.

Configuration Options

You can customize the retention behavior using these variables in databricks.yml:

raw_log_retention_action (default: DELETE)
- Valid values: DELETE, MOVE, OFF
- DELETE: Permanently removes processed files after the retention period
- MOVE: Archives processed files to a cold storage location
- OFF: Disables automatic cleanup (files are never deleted)
raw_log_retention_duration (default: 30 days)
- How long to retain raw log files before cleanup
- Format: <number> <unit> (e.g., 7 days, 90 days, 1 month)
raw_log_retention_move_destination (default: "")
- Destination path when using MOVE action
- Leave empty when using DELETE action
- Example: /Volumes/watchtower/default/archived_logs/

Example Configurations

Keep logs for 7 days, then delete:

variables:
  raw_log_retention_action: DELETE
  raw_log_retention_duration: 7 days

Archive logs to cold storage after 90 days:

variables:
  raw_log_retention_action: MOVE
  raw_log_retention_duration: 90 days
  raw_log_retention_move_destination: /Volumes/watchtower/default/archived_logs/

Disable automatic cleanup:

variables:
  raw_log_retention_action: OFF

Note: This feature was highlighted in the Databricks blog on scalable logging under Operational Considerations. While you can also implement retention using cloud provider lifecycle rules (S3/ADLS/GCS), this built-in approach is simpler and doesn't require additional cloud configuration.

Customizing for Your Project

Update databricks.yml or the resources/*.yml files with your job/pipeline settings
Modify src/watchtower/log_ingest_pipeline.py to customize the log ingestion pipeline
Modify the workspace host and root_path as needed
Modify terraform/main.tf as needed, or create a terraform/.auto.tfvars file to override Terraform variables.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
dashboards		dashboards
images		images
resources		resources
src/watchtower		src/watchtower
terraform		terraform
tests		tests
tools		tools
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.md		LICENSE.md
NOTICE.md		NOTICE.md
README.md		README.md
SECURITY.md		SECURITY.md
databricks.yml		databricks.yml
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Watchtower - Databricks Logging Solution

Prerequisites

Getting started

Teardown and Cleanup

Linting and Code Analysis

Run Tests

CI/CD Setup (Optional)

What Gets Deployed

Manual Commands (if you prefer)

Raw Log File Retention

Configuration Options

Example Configurations

Customizing for Your Project

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

databricks-industry-solutions/watchtower

Folders and files

Latest commit

History

Repository files navigation

Watchtower - Databricks Logging Solution

Prerequisites

Getting started

Teardown and Cleanup

Linting and Code Analysis

Run Tests

CI/CD Setup (Optional)

What Gets Deployed

Manual Commands (if you prefer)

Raw Log File Retention

Configuration Options

Example Configurations

Customizing for Your Project

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages