xql

`xql` - Querying Xarray Datasets with SQL

Running SQL like queries on Xarray Datasets. Consider dataset as a table and data variable as a column.

Note: For now, we support only zarr datasets and earth engine image collections.

Supported Features

Select Variables - From a large dataset having hundreds of variables select only needed variables.
Apply where clause - A general where condition like SQL. Applicable for queries which includes data for specific time range or only for specific regions.
group by Functions - This is supported on the coordinates only. e.g. time, latitude, longitude, etc.
aggregate Functions - Aggregate functions AVG(), MIN(), MAX(), etc. Only supported on data variables.
limit and offset clause - Apply limit and offset to filter out the required result.
For more checkout the road-map.

Note: For now, we support where conditions and groupby on coordinates only. orderby can only be applied either on selected variables or on coordinates.

Quickstart

Prerequisites

Get an access to the dataset you want to query. Here as an example we're going to use the analysis ready era5 public dataset. full_37-1h-0p25deg-chunk-1.zarr-v3.

For this gcloud must be configured in your local environment. Refer Initializing the gcloud CLI for configuring the gcloud locally.

Usage

# Install required packages
pip install xql

# Jump into xql
python xql/main.py xql

Supported meta commands

.help: For usage info.

.exit: To exit from the xql interpreter.

.set: To set the dataset uri as a shortened key.

.set era5 gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3

.show: To list down dataset shortened key. Eg. .show or .show [key]

.show era5

[query] => Any valid sql like query.

Example Queries

Apply a conditions. Query to get temperature of arctic region in January 2022:

SELECT 
    temperature 
FROM 'gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3' 
WHERE
    time >= '2022-01-01' AND 
    time < '2022-02-01' AND 
    latitude >= 66.5

Note: Multiline queries are not yet supported. Convert copied queries into single line before execution.

Aggregating results using Group By and Aggregate function. Daily average of temperature of arctic region in January 2022. Setting the table name as shortened key.
```
.set era5 gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3
```
```
SELECT 
    AVG(temperature), SUM(charnock), MIN('100m_v_component_of_wind') 
FROM era5
WHERE 
    time >= '2022-01-01' AND 
    time < '2022-02-01' AND 
    latitude >= 66.5
GROUP BY time_date
```
Replace time_date to time_month or time_year if monthly or yearly average is needed. Also use MIN() and MAX() functions same way as AVG().
caveat: Above queries run on the client's local machine and it generates a large two dimensional array so querying for very large amount of data will fall into out of memory erros.

e.g. Query like below will give OOM errors if the client machine don't have the enough RAM.
```
SELECT 
    evaporation,
    geopotential_at_surface,
    temperature 
FROM era5
```

Dask Cluster Configuration

Steps to deploy a Dask Cluster on GKE.

Create a Kubernetes cluster if don't have any. Follow Creating a zonal cluster.

Get Cluster Credentials on Local Machine

gcloud container clusters get-credentials {cluster_name} --region {cluster_region} --project {project}

Install helm. Follow Helm | Installing Helm.
Deploy Dask Cluster using below helm commands.
```
helm repo add dask https://helm.dask.org/
helm repo update
helm install xql-dask dask/dask
```
Replace xql-dask from above command with the name you want your dask cluster to have. Just set DASK_CLUSTER={dask_cluster_name} environment variable.

Connect to a dask cluster

from xql.utils import connect_dask_cluster
connect_dask_cluster()

Roadmap

Updated on 2024-01-08

`weather-lm` - Querying weather data using Natural Language prompts

Querying weather data using Natural Language prompts. This uses a gemini (large language model from Google) to generate SQL like queries and xql to execute that query.

Quickstart

Prerequisites

Google API Key is needed to initiate Language Model. Refer Setup your API key to generate that key.

Set that key as an environment variable. Run below command.

export GOOGLE_API_KEY="generate_key"

Usage

# Install required packages
pip install xql

# Jump into language model
python xql/main.py lm

Examples

Input Prompt: Daily average temperature of New York for January 2015

Relevant SQL Query:

SELECT 
    AVG('temperature') 
FROM 
    'gs://darshan-store/ar/2013-2022-full_37-1h-0p25deg-chunk-1.zarr-v3' 
WHERE 
    time >= '2015-01-01' AND 
    time < '2015-02-01' AND 
    city = 'New York' 
GROUP BY time_date

Output Data:

     time_date  avg_temperature
0   2015-01-01       240.978073
1   2015-01-02       243.375031
2   2015-01-03       244.584747
3   2015-01-04       249.673065
4   2015-01-05       245.650833
...
Query took: 00:01:55

Name		Name	Last commit message	Last commit date
parent directory ..
src		src
README.md		README.md
main.py		main.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

`xql` - Querying Xarray Datasets with SQL

Supported Features

Quickstart

Prerequisites

Usage

Supported meta commands

Example Queries

Dask Cluster Configuration

Roadmap

`weather-lm` - Querying weather data using Natural Language prompts

Quickstart

Prerequisites

Usage

Examples

FilesExpand file tree

xql

Directory actions

More options

Directory actions

More options

Latest commit

History

xql

Folders and files

parent directory

README.md

xql - Querying Xarray Datasets with SQL

Supported Features

Quickstart

Prerequisites

Usage

Supported meta commands

Example Queries

Dask Cluster Configuration

Roadmap

weather-lm - Querying weather data using Natural Language prompts

Quickstart

Prerequisites

Usage

Examples

`xql` - Querying Xarray Datasets with SQL

`weather-lm` - Querying weather data using Natural Language prompts