Skip to content

Commit

Permalink
SONARPY-2501 Create rule S7181 PySpark Window functions should always…
Browse files Browse the repository at this point in the history
… specify a frame (#4614)
  • Loading branch information
github-actions[bot] authored Feb 13, 2025
1 parent e3a3a43 commit a16475c
Show file tree
Hide file tree
Showing 3 changed files with 95 additions and 0 deletions.
2 changes: 2 additions & 0 deletions rules/S7181/metadata.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
{
}
25 changes: 25 additions & 0 deletions rules/S7181/python/metadata.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
{
"title": "PySpark Window functions should always specify a frame",
"type": "CODE_SMELL",
"status": "ready",
"remediation": {
"func": "Constant\/Issue",
"constantCost": "5min"
},
"tags": [
"data-science",
"pyspark"
],
"defaultSeverity": "Major",
"ruleSpecification": "RSPEC-7181",
"sqKey": "S7181",
"scope": "All",
"defaultQualityProfiles": ["Sonar way"],
"quickfix": "unknown",
"code": {
"impacts": {
"MAINTAINABILITY": "HIGH"
},
"attribute": "CONVENTIONAL"
}
}
68 changes: 68 additions & 0 deletions rules/S7181/python/rule.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
This rule raises an issue when a PySpark `Window` function is used without defining a frame.

== Why is this an issue?

In PySpark, a window defines a set of rows related to the current row, enabling calculations like running totals or rankings across these rows. It is useful for performing complex data analysis tasks by allowing computations over partitions of data while preserving the context of each row.

Depending on the operation you're willing to compute, you need to define a frame for the window. A frame defines the range of rows that are used in each computation. If you don't define a frame, a default frame is used.

The default frame that will be used depends on whether ordering is defined. When ordering is not defined, an unbounded window frame `(rowFrame, unboundedPreceding, unboundedFollowing)` is used by default. When ordering is defined, a growing window frame `(rangeFrame, unboundedPreceding, currentRow)` is used by default.

This can lead to unexpected results if the default frame is not what you intended. It is recommended to always define a frame when using a window function to avoid confusion and ensure the expected results.

== How to fix it

In order to fix this issue, make sure to explicitly define the frame when using a window function. For instance, you can use the `rowsBetween` or the `rangeBetween` methods to define the frame.

=== Code examples

==== Noncompliant code example

[source,python,diff-id=1,diff-type=noncompliant]
----
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import sum
spark = SparkSession.builder.appName("Example").getOrCreate()
data = [(1, "A", 100), (2, "A", 200), (3, "B", 300)]
df = spark.createDataFrame(data, ["id", "category", "value"])
window_spec = Window.partitionBy("category").orderBy("id") # Noncompliant: No explicit frame specified
df.withColumn("cumulative_sum", sum("value").over(window_spec)).show()
----

==== Compliant solution

[source,python,diff-id=1,diff-type=compliant]
----
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import col, sum
spark = SparkSession.builder.appName("Example").getOrCreate()
data = [(1, "A", 100), (2, "A", 200), (3, "B", 300)]
df = spark.createDataFrame(data, ["id", "category", "value"])
window_spec = Window.partitionBy("category").orderBy("id").rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing) # Compliant: Explicit frame specified
df.withColumn("cumulative_sum", sum("value").over(window_spec)).show()
----

== Resources
=== Documentation
* PySpark Documentation - https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Window.html[PySpark Documentation]

ifdef::env-github,rspecator-view[]

== Implementation Specification
(visible only on this page)

=== Message

Specify a frame for this PySpark Window function.

endif::env-github,rspecator-view[]

0 comments on commit a16475c

Please sign in to comment.