Skip to content

[SEDONA-721] Add Sedona vectorized udf for Python#1859

Merged
jiayuasu merged 13 commits into
masterfrom
sedona-arrow-udf-example
Apr 2, 2025
Merged

[SEDONA-721] Add Sedona vectorized udf for Python#1859
jiayuasu merged 13 commits into
masterfrom
sedona-arrow-udf-example

Conversation

@Imbruced

@Imbruced Imbruced commented Mar 16, 2025

Copy link
Copy Markdown
Member

Did you read the Contributor Guide?

Is this PR related to a ticket?

  • Yes, SEDONA-721.

What changes were proposed in this PR?

Sedona vectorized geometry udfs (scalar only now)

How was this patch tested?

unit tests

Did this PR include necessary documentation updates?

yes

@Imbruced

Copy link
Copy Markdown
Member Author

need to adjust it before I ll reopen it again

@Imbruced Imbruced reopened this Mar 16, 2025
@Imbruced

Copy link
Copy Markdown
Member Author

need to add docs for this one


val batchIter = if (batchSize > 0) new BatchIterator(iter, batchSize) else Iterator(iter)

val columnarBatchIter = new ArrowPythonRunner(

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a battle for this particular PR, but do we get to choose what the Python function is evaluating on or are we leaning on built-in Spark things such that we are forced to have this be a function of a pandas series? (if it could be a function of, for example, two numpy arrays for points or Arrow buffers more generally, it would open up some options in terms of speed).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not super opting this solution as well. I just wanted to unlock the arrow based udf in Sedona. I totally agree that we can do better. Right now based on my internal tests it's 2 times faster than normal udf.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! At some point my Spark/Scala will be good enough to see if there's any room to improve on that 🙂

@Imbruced Imbruced Mar 18, 2025

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to help on that 🙇

@Imbruced Imbruced force-pushed the sedona-arrow-udf-example branch from 186feae to 131622a Compare March 22, 2025 22:08
@github-actions github-actions Bot added the docs label Mar 22, 2025
@Imbruced Imbruced marked this pull request as ready for review March 22, 2025 22:23
@Imbruced Imbruced requested a review from jiayuasu as a code owner March 22, 2025 22:23
@Imbruced

Imbruced commented Mar 22, 2025

Copy link
Copy Markdown
Member Author

@jiayuasu, please let me know what you think. Maybe we should turn this off by default and make it experimental?

Comment thread .github/workflows/java.yml Outdated
Comment on lines +63 to +75
for i in range(5):
start = time()
area1 = self.get_area(df, vectorized_buffer)

assert area1 > 478

vectorized_times.append(time() - start)

area2 = self.get_area(df, buffer_distanced_udf)

assert area2 > 478

non_vectorized_times.append(time() - start)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not resetting start before calling buffer_distanced_udf, the non_vectorized_time will be the total time of calling vectorized_buffer and buffer_distanced_udf.

I also wonder if this test could be flaky because the size of dataset is not large enough to exhibit the performance advantage of Arrow UDF.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a good point; I missed the start here. For sure, for larger datasets, there is an improvement; maybe we don't need the test at all to test the performance. Maybe having benchmark tests only when releasing is a good idea?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Let's remove it from the unit test.

Comment on lines +16 to +23
def non_vectorized_buffer_udf(geom: b.BaseGeometry) -> b.BaseGeometry:
return geom.buffer(0.1)


@sedona_vectorized_udf()
def vectorized_buffer(geom: b.BaseGeometry) -> b.BaseGeometry:
return geom.buffer(0.1)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename vectorized_buffer to vectorized_buffer_udf to be consistent with non_vectorized_buffer_udf.


import scala.collection.mutable

class ExtractSedonaUDFRule extends Rule[LogicalPlan] {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add comments to this class declaring how this rule is different from org.apache.spark.sql.execution.python.ExtractPythonUDFs?


import scala.collection.JavaConverters.asScalaIteratorConverter

//import scala.jdk.CollectionConverters.asScalaIteratorConverter

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove commented imports.

package org.apache.sedona.sql.UDF

object PythonEvalType {
val SQL_SCALAR_SEDONA_UDF = 5200

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add comment to clarify that 5200 is SEDONA_UDF_TYPE_CONSTANT + SQL_SCALAR_PANDAS_UDF, where SQL_SCALAR_PANDAS_UDF is 200.

Comment on lines +44 to +54
case class SedonaArrowEvalPythonExec(
udfs: Seq[PythonUDF],
resultAttrs: Seq[Attribute],
child: SparkPlan,
evalType: Int)
extends EvalPythonExec
with PythonSQLMetrics {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add comments declaring how it is different from ArrowEvalPythonExec? I can see that we omitted the checks on the output types.

Comment thread docs/tutorial/sql.md
Comment on lines +1224 to +1255
### Shapely scalar UDF

```python
import shapely.geometry.base as b
from sedona.sql.functions import sedona_vectorized_udf, SedonaUDFType

@sedona_vectorized_udf()
def vectorized_buffer(geom: b.BaseGeometry) -> b.BaseGeometry:
return geom.buffer(0.1)
```

### GeoSeries UDF

```python
import geopandas as gpd
from sedona.sql.functions import sedona_vectorized_udf, SedonaUDFType

@sedona_vectorized_udf(udf_type=SedonaUDFType.GEO_SERIES)
def vectorized_geo_series_buffer(series: gpd.GeoSeries) -> gpd.GeoSeries:
buffered = series.buffer(0.1)

return buffered
```

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we support other variants of UDFs involving geometries?

  • UDFs taking geometry as input and returning a numeric value
  • UDFs taking a numeric value and returning a geometry object
  • UDFs that has more than 1 parameters, has geometry as parameter type or return type

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me verify those; my plan is to add new functionalities later, like table functions or agg functions.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks I messed up a little, now I fixed it and tested

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if this is going to work, as more than 1 column is not scala, and I didn't intend to add this in this MR; I am planning to add this later.

@Kontinuation

Copy link
Copy Markdown
Member

I think it is OK to have this PR as a workaround for supporting UDFs involving geometry types. The final solution is to push SPARK-34771 and apache/spark#31735 forward to make Spark supports UDFs involving UDTs generally.

@Imbruced

Copy link
Copy Markdown
Member Author

I think it is OK to have this PR as a workaround for supporting UDFs involving geometry types. The final solution is to push SPARK-34771 and apache/spark#31735 forward to make Spark supports UDFs involving UDTs generally.

Yes, I agree. This MR aims to unlock arrow udfs for geometry type. I thought of moving this forward with custom geopandas or duckdb runners based on Geoarrow. That's why I added new Python functions instead of reusing the existing pandas_udf from spark.


import spark.implicits._

test("Chained Scalar Pandas UDFs should be combined to a single physical node") {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of this test? Are we testing the physical plan? I don't see any assertion if it is supposed to test on the physical plan tree node.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sry, I changed the name of the test

* specific language governing permissions and limitations
* under the License.
*/
package org.apache.spark.sql.udf

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why this is put under the spark-3.5 folder? Is it version specific code?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the one only is working with Spark 3.5

@Imbruced Imbruced force-pushed the sedona-arrow-udf-example branch from e2bccb7 to 6277cbd Compare March 29, 2025 16:15
Comment thread docs/tutorial/sql.md Outdated
Let's analyze the two examples below, that creates buffers from
a given geometry.

Make sure

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you fix the sentence here?

@jiayuasu jiayuasu changed the title Sedona arrow udf example [SEDONA-721] Add Sedona vectorized udf for Python Mar 31, 2025
@jiayuasu jiayuasu added this to the sedona-1.8.0 milestone Mar 31, 2025
@@ -0,0 +1,126 @@
import inspect

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apache File header please

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't realize that it is not added automatically

@@ -0,0 +1,213 @@
from sedona.sql.types import GeometryType

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apache header please.

@Imbruced Imbruced requested a review from jiayuasu April 1, 2025 20:48
@jiayuasu jiayuasu merged commit 1798df2 into master Apr 2, 2025
@jiayuasu jiayuasu deleted the sedona-arrow-udf-example branch April 3, 2025 06:55
Kontinuation added a commit to Kontinuation/sedona that referenced this pull request Jan 21, 2026
* SEDONA-721 Add Sedona vectorized udf.

* SEDONA-721 Add documentation

* SEDONA-721 Add documentation

* SEDONA-721 Add documentation

* Update .github/workflows/java.yml

Co-authored-by: Kristin Cowalcijk <kontinuation@apache.org>

* SEDONA-721 Apply requested changes.

* SEDONA-721 Apply requested changes.

* SEDONA-721 Apply requested changes.

* SEDONA-721 Apply requested changes.

* SEDONA-721 Apply requested changes.

* SEDONA-721 Apply requested changes.

* SEDONA-721 Apply requested changes.

* SEDONA-721 Apply requested changes.

---------

Co-authored-by: Kristin Cowalcijk <kontinuation@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants