Running SQL like queries on Xarray Datasets. Consider dataset as a table and data variable as a column.
Note: For now, we support only zarr datasets and earth engine image collections.
SelectVariables - From a large dataset having hundreds of variables select only needed variables.- Apply
whereclause - A general where condition like SQL. Applicable for queries which includes data for specific time range or only for specific regions. group byFunctions - This is supported on the coordinates only. e.g. time, latitude, longitude, etc.aggregateFunctions - Aggregate functionsAVG(),MIN(),MAX(), etc. Only supported on data variables.limit and offsetclause - Apply limit and offset to filter out the required result.- For more checkout the road-map.
Note: For now, we support
whereconditions andgroupbyon coordinates only.orderbycan only be applied either on selected variables or on coordinates.
Get an access to the dataset you want to query. Here as an example we're going to use the analysis ready era5 public dataset. full_37-1h-0p25deg-chunk-1.zarr-v3.
For this gcloud must be configured in your local environment. Refer Initializing the gcloud CLI for configuring the gcloud locally.
# Install required packages
pip install xql
# Jump into xql
python xql/main.py xql
.help: For usage info.
.exit: To exit from the xql interpreter.
.set: To set the dataset uri as a shortened key.
.set era5 gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3
.show: To list down dataset shortened key. Eg. .show or .show [key]
.show era5
[query] => Any valid sql like query.
-
Apply a conditions. Query to get temperature of arctic region in January 2022:
SELECT temperature FROM 'gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3' WHERE time >= '2022-01-01' AND time < '2022-02-01' AND latitude >= 66.5Note: Multiline queries are not yet supported. Convert copied queries into single line before execution.
-
Aggregating results using Group By and Aggregate function. Daily average of temperature of arctic region in January 2022. Setting the table name as shortened key.
.set era5 gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3SELECT AVG(temperature), SUM(charnock), MIN('100m_v_component_of_wind') FROM era5 WHERE time >= '2022-01-01' AND time < '2022-02-01' AND latitude >= 66.5 GROUP BY time_dateReplace
time_datetotime_monthortime_yearif monthly or yearly average is needed. Also useMIN()andMAX()functions same way asAVG(). -
caveat: Above queries run on the client's local machine and it generates a large two dimensional array so querying for very large amount of data will fall into out of memory erros.e.g. Query like below will give OOM errors if the client machine don't have the enough RAM.
SELECT evaporation, geopotential_at_surface, temperature FROM era5
Steps to deploy a Dask Cluster on GKE.
- Create a Kubernetes cluster if don't have any. Follow Creating a zonal cluster.
- Get Cluster Credentials on Local Machine
gcloud container clusters get-credentials {cluster_name} --region {cluster_region} --project {project} - Install
helm. Follow Helm | Installing Helm. - Deploy Dask Cluster using below
helmcommands.helm repo add dask https://helm.dask.org/ helm repo update helm install xql-dask dask/daskReplace
xql-daskfrom above command with the name you want your dask cluster to have. Just setDASK_CLUSTER={dask_cluster_name}environment variable. - Connect to a dask cluster
from xql.utils import connect_dask_cluster connect_dask_cluster()
Updated on 2024-01-08
- Select Variables
- On Coordinates
- On Variables
- Where Clause:
=,>,>=,<,<=, etc.- On Coordinates
- On Variables
- Aggregate Functions: Only
AVG(),MIN(),MAX(),SUM()are supported.- With Group By
- Without Group By
- Multiple Aggregate function in a single query
- Order By: Apply sorting on the result.
- Limit: Limiting the result to display.
- Mathematical Operators
(+, - , *, / ): Add support to use mathematical operators in the query. - Aliases: Add support to alias while querying.
- Join Operations: Support joining tables and apply query.
- Nested Queries: Add support to write nested queries.
- Custom Aggregate Functions: Support custom aggregate functions
Querying weather data using Natural Language prompts. This uses a gemini (large language model from Google) to generate SQL like queries and xql to execute that query.
Google API Key is needed to initiate Language Model. Refer Setup your API key to generate that key.
Set that key as an environment variable. Run below command.
export GOOGLE_API_KEY="generate_key"
# Install required packages
pip install xql
# Jump into language model
python xql/main.py lm
Input Prompt: Daily average temperature of New York for January 2015
Relevant SQL Query:
SELECT
AVG('temperature')
FROM
'gs://darshan-store/ar/2013-2022-full_37-1h-0p25deg-chunk-1.zarr-v3'
WHERE
time >= '2015-01-01' AND
time < '2015-02-01' AND
city = 'New York'
GROUP BY time_date
Output Data:
time_date avg_temperature
0 2015-01-01 240.978073
1 2015-01-02 243.375031
2 2015-01-03 244.584747
3 2015-01-04 249.673065
4 2015-01-05 245.650833
...
Query took: 00:01:55