[SEDONA-721] Add Sedona vectorized udf for Python by Imbruced · Pull Request #1859 · apache/sedona

Imbruced · 2025-03-16T16:04:16Z

Did you read the Contributor Guide?

Yes, I have read the Contributor Rules and Contributor Development Guide

Is this PR related to a ticket?

Yes, SEDONA-721.

What changes were proposed in this PR?

Sedona vectorized geometry udfs (scalar only now)

How was this patch tested?

unit tests

Did this PR include necessary documentation updates?

yes

Imbruced · 2025-03-16T16:31:57Z

need to adjust it before I ll reopen it again

Imbruced · 2025-03-17T22:06:47Z

need to add docs for this one

paleolimbot · 2025-03-18T19:33:22Z

+
+    val batchIter = if (batchSize > 0) new BatchIterator(iter, batchSize) else Iterator(iter)
+
+    val columnarBatchIter = new ArrowPythonRunner(


Not a battle for this particular PR, but do we get to choose what the Python function is evaluating on or are we leaning on built-in Spark things such that we are forced to have this be a function of a pandas series? (if it could be a function of, for example, two numpy arrays for points or Arrow buffers more generally, it would open up some options in terms of speed).

I am not super opting this solution as well. I just wanted to unlock the arrow based udf in Sedona. I totally agree that we can do better. Right now based on my internal tests it's 2 times faster than normal udf.

Awesome! At some point my Spark/Scala will be good enough to see if there's any room to improve on that 🙂

I would like to help on that 🙇

Imbruced · 2025-03-22T22:25:52Z

@jiayuasu, please let me know what you think. Maybe we should turn this off by default and make it experimental?

Kontinuation · 2025-03-24T01:52:39Z

+        for i in range(5):
+            start = time()
+            area1 = self.get_area(df, vectorized_buffer)
+
+            assert area1 > 478
+
+            vectorized_times.append(time() - start)
+
+            area2 = self.get_area(df, buffer_distanced_udf)
+
+            assert area2 > 478
+
+            non_vectorized_times.append(time() - start)


We are not resetting start before calling buffer_distanced_udf, the non_vectorized_time will be the total time of calling vectorized_buffer and buffer_distanced_udf.

I also wonder if this test could be flaky because the size of dataset is not large enough to exhibit the performance advantage of Arrow UDF.

Yeah, that's a good point; I missed the start here. For sure, for larger datasets, there is an improvement; maybe we don't need the test at all to test the performance. Maybe having benchmark tests only when releasing is a good idea?

Agreed. Let's remove it from the unit test.

Kontinuation · 2025-03-24T01:56:08Z

+def non_vectorized_buffer_udf(geom: b.BaseGeometry) -> b.BaseGeometry:
+    return geom.buffer(0.1)
+
+
+@sedona_vectorized_udf()
+def vectorized_buffer(geom: b.BaseGeometry) -> b.BaseGeometry:
+    return geom.buffer(0.1)


Rename vectorized_buffer to vectorized_buffer_udf to be consistent with non_vectorized_buffer_udf.

Kontinuation · 2025-03-24T02:05:03Z

+
+import scala.collection.mutable
+
+class ExtractSedonaUDFRule extends Rule[LogicalPlan] {


Can you add comments to this class declaring how this rule is different from org.apache.spark.sql.execution.python.ExtractPythonUDFs?

Kontinuation · 2025-03-24T02:06:56Z

+
+import scala.collection.JavaConverters.asScalaIteratorConverter
+
+//import scala.jdk.CollectionConverters.asScalaIteratorConverter


Remove commented imports.

Kontinuation · 2025-03-24T02:12:40Z

+package org.apache.sedona.sql.UDF
+
+object PythonEvalType {
+  val SQL_SCALAR_SEDONA_UDF = 5200


Let's add comment to clarify that 5200 is SEDONA_UDF_TYPE_CONSTANT + SQL_SCALAR_PANDAS_UDF, where SQL_SCALAR_PANDAS_UDF is 200.

Kontinuation · 2025-03-24T02:17:17Z

+case class SedonaArrowEvalPythonExec(
+    udfs: Seq[PythonUDF],
+    resultAttrs: Seq[Attribute],
+    child: SparkPlan,
+    evalType: Int)
+    extends EvalPythonExec
+    with PythonSQLMetrics {


Can you add comments declaring how it is different from ArrowEvalPythonExec? I can see that we omitted the checks on the output types.

Kontinuation · 2025-03-24T02:31:42Z

+### Shapely scalar UDF
+
+```python
+import shapely.geometry.base as b
+from sedona.sql.functions import sedona_vectorized_udf, SedonaUDFType
+
+@sedona_vectorized_udf()
+def vectorized_buffer(geom: b.BaseGeometry) -> b.BaseGeometry:
+    return geom.buffer(0.1)
+```
+
+### GeoSeries UDF
+
+```python
+import geopandas as gpd
+from sedona.sql.functions import sedona_vectorized_udf, SedonaUDFType
+
+@sedona_vectorized_udf(udf_type=SedonaUDFType.GEO_SERIES)
+def vectorized_geo_series_buffer(series: gpd.GeoSeries) -> gpd.GeoSeries:
+    buffered = series.buffer(0.1)
+
+    return buffered
+```


Do we support other variants of UDFs involving geometries?

UDFs taking geometry as input and returning a numeric value

UDFs taking a numeric value and returning a geometry object

UDFs that has more than 1 parameters, has geometry as parameter type or return type

Let me verify those; my plan is to add new functionalities later, like table functions or agg functions.

Looks I messed up a little, now I fixed it and tested

I am not sure if this is going to work, as more than 1 column is not scala, and I didn't intend to add this in this MR; I am planning to add this later.

Kontinuation · 2025-03-24T02:40:49Z

I think it is OK to have this PR as a workaround for supporting UDFs involving geometry types. The final solution is to push SPARK-34771 and apache/spark#31735 forward to make Spark supports UDFs involving UDTs generally.

Imbruced · 2025-03-24T12:01:22Z

I think it is OK to have this PR as a workaround for supporting UDFs involving geometry types. The final solution is to push SPARK-34771 and apache/spark#31735 forward to make Spark supports UDFs involving UDTs generally.

Yes, I agree. This MR aims to unlock arrow udfs for geometry type. I thought of moving this forward with custom geopandas or duckdb runners based on Geoarrow. That's why I added new Python functions instead of reusing the existing pandas_udf from spark.

zhangfengcdt · 2025-03-24T20:29:10Z

+
+  import spark.implicits._
+
+  test("Chained Scalar Pandas UDFs should be combined to a single physical node") {


What is the purpose of this test? Are we testing the physical plan? I don't see any assertion if it is supposed to test on the physical plan tree node.

sry, I changed the name of the test

zhangfengcdt · 2025-03-24T20:30:47Z

+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.spark.sql.udf


Any reason why this is put under the spark-3.5 folder? Is it version specific code?

yes, the one only is working with Spark 3.5

Co-authored-by: Kristin Cowalcijk <kontinuation@apache.org>

jiayuasu · 2025-03-31T04:53:59Z

+Let's analyze the two examples below, that creates buffers from
+a given geometry.
+
+Make sure


Can you fix the sentence here?

jiayuasu · 2025-03-31T23:13:51Z

@@ -0,0 +1,126 @@
+import inspect


Apache File header please

I didn't realize that it is not added automatically

jiayuasu · 2025-03-31T23:14:51Z

@@ -0,0 +1,213 @@
+from sedona.sql.types import GeometryType


Apache header please.

* SEDONA-721 Add Sedona vectorized udf. * SEDONA-721 Add documentation * SEDONA-721 Add documentation * SEDONA-721 Add documentation * Update .github/workflows/java.yml Co-authored-by: Kristin Cowalcijk <kontinuation@apache.org> * SEDONA-721 Apply requested changes. * SEDONA-721 Apply requested changes. * SEDONA-721 Apply requested changes. * SEDONA-721 Apply requested changes. * SEDONA-721 Apply requested changes. * SEDONA-721 Apply requested changes. * SEDONA-721 Apply requested changes. * SEDONA-721 Apply requested changes. --------- Co-authored-by: Kristin Cowalcijk <kontinuation@apache.org>

github-actions Bot added sedona-python github-actions sedona-spark labels Mar 16, 2025

Imbruced closed this Mar 16, 2025

Imbruced reopened this Mar 16, 2025

paleolimbot reviewed Mar 18, 2025

View reviewed changes

Imbruced force-pushed the sedona-arrow-udf-example branch from 186feae to 131622a Compare March 22, 2025 22:08

github-actions Bot added the docs label Mar 22, 2025

Imbruced marked this pull request as ready for review March 22, 2025 22:23

Imbruced requested a review from jiayuasu as a code owner March 22, 2025 22:23

jiayuasu requested review from Kontinuation and zhangfengcdt March 22, 2025 22:26

Kontinuation reviewed Mar 24, 2025

View reviewed changes

zhangfengcdt reviewed Mar 24, 2025

View reviewed changes

Imbruced and others added 7 commits March 29, 2025 16:36

SEDONA-721 Add Sedona vectorized udf.

286a610

SEDONA-721 Add documentation

f281ede

SEDONA-721 Add documentation

932c8ad

SEDONA-721 Add documentation

2749a74

Update .github/workflows/java.yml

264305a

Co-authored-by: Kristin Cowalcijk <kontinuation@apache.org>

SEDONA-721 Apply requested changes.

d5b9f5f

SEDONA-721 Apply requested changes.

6277cbd

Imbruced force-pushed the sedona-arrow-udf-example branch from e2bccb7 to 6277cbd Compare March 29, 2025 16:15

Imbruced added 3 commits March 29, 2025 17:40

SEDONA-721 Apply requested changes.

95e129a

SEDONA-721 Apply requested changes.

bb0a8fb

SEDONA-721 Apply requested changes.

f1d468d

SEDONA-721 Apply requested changes.

7219c81

jiayuasu reviewed Mar 31, 2025

View reviewed changes

SEDONA-721 Apply requested changes.

e1a32d7

jiayuasu changed the title ~~Sedona arrow udf example~~ [SEDONA-721] Add Sedona vectorized udf for Python Mar 31, 2025

jiayuasu added this to the sedona-1.8.0 milestone Mar 31, 2025

jiayuasu added the improvement label Mar 31, 2025

Imbruced requested review from Kontinuation, jiayuasu and zhangfengcdt March 31, 2025 17:02

jiayuasu approved these changes Mar 31, 2025

View reviewed changes

jiayuasu reviewed Mar 31, 2025

View reviewed changes

jiayuasu requested changes Mar 31, 2025

View reviewed changes

SEDONA-721 Apply requested changes.

9b34875

Imbruced requested a review from jiayuasu April 1, 2025 20:48

jiayuasu approved these changes Apr 2, 2025

View reviewed changes

jiayuasu merged commit 1798df2 into master Apr 2, 2025

jiayuasu deleted the sedona-arrow-udf-example branch April 3, 2025 06:55


		val batchIter = if (batchSize > 0) new BatchIterator(iter, batchSize) else Iterator(iter)

		val columnarBatchIter = new ArrowPythonRunner(


		import scala.collection.mutable

		class ExtractSedonaUDFRule extends Rule[LogicalPlan] {


		import scala.collection.JavaConverters.asScalaIteratorConverter

		//import scala.jdk.CollectionConverters.asScalaIteratorConverter


		import spark.implicits._

		test("Chained Scalar Pandas UDFs should be combined to a single physical node") {

Conversation

Imbruced commented Mar 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Did you read the Contributor Guide?

Is this PR related to a ticket?

What changes were proposed in this PR?

How was this patch tested?

Did this PR include necessary documentation updates?

Uh oh!

Imbruced commented Mar 16, 2025

Uh oh!

Imbruced commented Mar 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Imbruced Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Imbruced commented Mar 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Kontinuation commented Mar 24, 2025

Uh oh!

Imbruced commented Mar 24, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Imbruced commented Mar 16, 2025 •

edited

Loading

Imbruced Mar 18, 2025 •

edited

Loading

Imbruced commented Mar 22, 2025 •

edited

Loading