Spark SQL functions.scala 源码二Aggregate functions Spark 3.3.0

Shockang

2024-05-06 帮助1人

前言

本文隶属于专栏《1000个问题搞定大数据技术体系》，该专栏为笔者原创，引用请注明来源，不足和错误之处请在评论区帮忙指出，谢谢！

本专栏目录结构和参考文献请见1000个问题搞定大数据技术体系

关联

Spark SQL 内置函数（五）Aggregate Functions（基于 Spark 3.2.0）

正文

approx_count_distinct

  /**
   * @group agg_funcs
   * @since 1.3.0
   */
  @deprecated("Use approx_count_distinct", "2.1.0")
  def approxCountDistinct(e: Column): Column = approx_count_distinct(e)

  /**
   * @group agg_funcs
   * @since 1.3.0
   */
  @deprecated("Use approx_count_distinct", "2.1.0")
  def approxCountDistinct(columnName: String): Column = approx_count_distinct(columnName)

  /**
   * @group agg_funcs
   * @since 1.3.0
   */
  @deprecated("Use approx_count_distinct", "2.1.0")
  def approxCountDistinct(e: Column, rsd: Double): Column = approx_count_distinct(e, rsd)

  /**
   * @group agg_funcs
   * @since 1.3.0
   */
  @deprecated("Use approx_count_distinct", "2.1.0")
  def approxCountDistinct(columnName: String, rsd: Double): Column = {
    approx_count_distinct(Column(columnName), rsd)
  }

  /**
   * 聚合函数：返回一组数据中不同项目的大致数量。
   *
   * @group agg_funcs
   * @since 2.1.0
   */
  def approx_count_distinct(e: Column): Column = withAggregateFunction {
    HyperLogLogPlusPlus(e.expr)
  }

  /**
   * 聚合函数：返回一组数据中不同项目的大致数量。
   *
   * @group agg_funcs
   * @since 2.1.0
   */
  def approx_count_distinct(columnName: String): Column = approx_count_distinct(column(columnName))

  /**
   * 聚合函数：返回一组数据中不同项目的大致数量。
   *
   * @param rsd maximum 允许的最大相对标准偏差（默认值 = 0.05）
   *
   * @group agg_funcs
   * @since 2.1.0
   */
  def approx_count_distinct(e: Column, rsd: Double): Column = withAggregateFunction {
    HyperLogLogPlusPlus(e.expr, rsd, 0, 0)
  }

  /**
   *聚合函数：返回一组数据中不同项目的大致数量。
   *
   * @param rsd maximum 允许的最大相对标准偏差（默认值 = 0.05）
   *
   * @group agg_funcs
   * @since 2.1.0
   */
  def approx_count_distinct(columnName: String, rsd: Double): Column = {
    approx_count_distinct(Column(columnName), rsd)
  }

用法

========== df.select(approxCountDistinct(col("age"))) ==========
 -------------------------- 
|approx_count_distinct(age)|
 -------------------------- 
|                         5|
 -------------------------- 

========== df.select(approxCountDistinct("age")) ==========
 -------------------------- 
|approx_count_distinct(age)|
 -------------------------- 
|                         5|
 -------------------------- 

========== df.select(approxCountDistinct(col("age"), 0.05)) ==========
 -------------------------- 
|approx_count_distinct(age)|
 -------------------------- 
|                         5|
 -------------------------- 

========== df.select(approxCountDistinct("age", 0.05)) ==========
 -------------------------- 
|approx_count_distinct(age)|
 -------------------------- 
|                         5|
 -------------------------- 

========== df.select(approx_count_distinct(col("age"))) ==========
 -------------------------- 
|approx_count_distinct(age)|
 -------------------------- 
|                         5|
 -------------------------- 

========== df.select(approx_count_distinct("age")) ==========
 -------------------------- 
|approx_count_distinct(age)|
 -------------------------- 
|                         5|
 -------------------------- 

========== df.select(approx_count_distinct(col("age"), 0.05)) ==========
 -------------------------- 
|approx_count_distinct(age)|
 -------------------------- 
|                         5|
 -------------------------- 

========== df.select(approx_count_distinct("age", 0.05)) ==========
 -------------------------- 
|approx_count_distinct(age)|
 -------------------------- 
|                         5|
 --------------------------

avg

  /**
   * 聚合函数：返回一组值中的平均值。
   *
   * @group agg_funcs
   * @since 1.3.0
   */
  def avg(e: Column): Column = withAggregateFunction { Average(e.expr) }

  /**
   * 聚合函数：返回一组值中的平均值。
   *
   * @group agg_funcs
   * @since 1.3.0
   */
  def avg(columnName: String): Column = avg(Column(columnName))

用法

========== df.select(avg(col("age"))) ==========
 ------------------ 
|          avg(age)|
 ------------------ 
|29.166666666666668|
 ------------------ 

========== df.select(avg("age")) ==========
 ------------------ 
|          avg(age)|
 ------------------ 
|29.166666666666668|
 ------------------

collect_list/collect_set

  /**
   * 聚合函数：返回具有重复项的对象列表。
   * 注意：该函数是不确定的，因为收集结果的顺序取决于行的顺序，这在 shuffle 之后可能是不确定的。
   *
   * @group agg_funcs
   * @since 1.6.0
   */
  def collect_list(e: Column): Column = withAggregateFunction { CollectList(e.expr) }

  /**
   * 聚合函数：返回具有重复项的对象列表。
   * 注意：该函数是不确定的，因为收集结果的顺序取决于行的顺序，这在 shuffle 之后可能是不确定的。
   *
   * @group agg_funcs
   * @since 1.6.0
   */
  def collect_list(columnName: String): Column = collect_list(Column(columnName))

  /**
   * 聚合函数：返回消除了重复项的对象列表。
   * 注意：该函数是不确定的，因为收集结果的顺序取决于行的顺序，这在 shuffle 之后可能是不确定的。
   *
   * @group agg_funcs
   * @since 1.6.0
   */
  def collect_set(e: Column): Column = withAggregateFunction { CollectSet(e.expr) }

  /**
   * 聚合函数：返回消除了重复项的对象列表。
   * 注意：该函数是不确定的，因为收集结果的顺序取决于行的顺序，这在 shuffle 之后可能是不确定的。
   *
   * @group agg_funcs
   * @since 1.6.0
   */
  def collect_set(columnName: String): Column = collect_set(Column(columnName))

用法

========== df.select(collect_list(col("age"))) ==========
 -------------------- 
|   collect_list(age)|
 -------------------- 
|[30, 28, 24, 34, ...|
 -------------------- 

========== df.select(collect_list("age")) ==========
 -------------------- 
|   collect_list(age)|
 -------------------- 
|[30, 28, 24, 34, ...|
 -------------------- 

========== df.select(collect_set(col("age"))) ==========
 -------------------- 
|    collect_set(age)|
 -------------------- 
|[30, 24, 28, 34, 35]|
 -------------------- 

========== df.select(collect_set("age")) ==========
 -------------------- 
|    collect_set(age)|
 -------------------- 
|[30, 24, 28, 34, 35]|
 --------------------

corr

  /**
   * 聚合函数：返回两列的皮尔逊相关系数。
   *
   * @group agg_funcs
   * @since 1.6.0
   */
  def corr(column1: Column, column2: Column): Column = withAggregateFunction {
    Corr(column1.expr, column2.expr)
  }

  /**
   * 聚合函数：返回两列的皮尔逊相关系数。
   *
   * @group agg_funcs
   * @since 1.6.0
   */
  def corr(columnName1: String, columnName2: String): Column = {
    corr(Column(columnName1), Column(columnName2))
  }

做相似度计算的时候经常会用到皮尔逊相关系数（Pearson Correlation Coefficient）。
皮尔逊系数可以看作两组数据的向量夹角的余弦。

用法

========== df.select(corr(col("age"), col("salary"))) ==========
 ------------------ 
| corr(age, salary)|
 ------------------ 
|0.9789297418835571|
 ------------------ 

========== df.select(corr("age", "salary")) ==========
 ------------------ 
| corr(age, salary)|
 ------------------ 
|0.9789297418835571|
 ------------------

count/count_distinct

  /**
   * 聚合函数：返回一组中的项目数。
   *
   * @group agg_funcs
   * @since 1.3.0
   */
  def count(e: Column): Column = withAggregateFunction {
    e.expr match {
      // Turn count(*) into count(1)
      case s: Star => Count(Literal(1))
      case _ => Count(e.expr)
    }
  }

  /**
   * 聚合函数：返回一组中的项目数。
   *
   * @group agg_funcs
   * @since 1.3.0
   */
  def count(columnName: String): TypedColumn[Any, Long] =
    count(Column(columnName)).as(ExpressionEncoder[Long]())

  /**
   * 聚合函数：返回一组中不同项目的数量。
   *
   * 又称 `count_distinct`，鼓励直接使用 `count_distinct` 。
   *
   * @group agg_funcs
   * @since 1.3.0
   */
  @scala.annotation.varargs
  def countDistinct(expr: Column, exprs: Column*): Column = count_distinct(expr, exprs: _*)

  /**
   * 聚合函数：返回一组中不同项目的数量。
   *
   * 又称 `count_distinct`，鼓励直接使用 `count_distinct` 。
   *
   * @group agg_funcs
   * @since 1.3.0
   */
  @scala.annotation.varargs
  def countDistinct(columnName: String, columnNames: String*): Column =
    count_distinct(Column(columnName), columnNames.map(Column.apply) : _*)

  /**
   * 聚合函数：返回一组中不同项目的数量。
   *
   * @group agg_funcs
   * @since 3.2.0
   */
  @scala.annotation.varargs
  def count_distinct(expr: Column, exprs: Column*): Column =
    // 对于像countDistinct("")这样的用法，我们应该让分析器扩展star和resolve函数。
    Column(UnresolvedFunction("count", (expr  : exprs).map(_.expr), isDistinct = true))

用法

========== df.select(count(col("age"))) ==========
 ---------- 
|count(age)|
 ---------- 
|         6|
 ---------- 

========== df.select(count("age")) ==========
 ---------- 
|count(age)|
 ---------- 
|         6|
 ---------- 

========== df.select(countDistinct(col("age"))) ==========
 ------------------- 
|count(DISTINCT age)|
 ------------------- 
|                  5|
 ------------------- 

========== df.select(countDistinct("age")) ==========
 ------------------- 
|count(DISTINCT age)|
 ------------------- 
|                  5|
 ------------------- 

========== df.select(count_distinct(col("age"))) ==========
 ------------------- 
|count(DISTINCT age)|
 ------------------- 
|                  5|
 ------------------- 

========== df.select(count_distinct(col("age"), col("salary"))) ==========
 --------------------------- 
|count(DISTINCT age, salary)|
 --------------------------- 
|                          6|
 ---------------------------

covar_pop/covar_samp

  /**
   * 聚合函数：返回两列的总体协方差。
   *
   * @group agg_funcs
   * @since 2.0.0
   */
  def covar_pop(column1: Column, column2: Column): Column = withAggregateFunction {
    CovPopulation(column1.expr, column2.expr)
  }

  /**
   *  聚合函数：返回两列的总体协方差。
   *
   * @group agg_funcs
   * @since 2.0.0
   */
  def covar_pop(columnName1: String, columnName2: String): Column = {
    covar_pop(Column(columnName1), Column(columnName2))
  }

  /**
   *  聚合函数：返回两列的样本协方差。
   *
   * @group agg_funcs
   * @since 2.0.0
   */
  def covar_samp(column1: Column, column2: Column): Column = withAggregateFunction {
    CovSample(column1.expr, column2.expr)
  }

  /**
   * 聚合函数：返回两列的样本协方差。
   *
   * @group agg_funcs
   * @since 2.0.0
   */
  def covar_samp(columnName1: String, columnName2: String): Column = {
    covar_samp(Column(columnName1), Column(columnName2))
  }

用法

========== df.select(covar_pop(col("age"), col("salary"))) ==========
 ---------------------- 
|covar_pop(age, salary)|
 ---------------------- 
|                7250.0|
 ---------------------- 

========== df.select(covar_pop("age", "salary")) ==========
 ---------------------- 
|covar_pop(age, salary)|
 ---------------------- 
|                7250.0|
 ---------------------- 

========== df.select(covar_samp(col("age"), col("salary"))) ==========
 ----------------------- 
|covar_samp(age, salary)|
 ----------------------- 
|                 8700.0|
 ----------------------- 

========== df.select(covar_samp("age", "salary")) ==========
 ----------------------- 
|covar_samp(age, salary)|
 ----------------------- 
|                 8700.0|
 -----------------------

first

  /**
   * 聚合函数：返回一组中的第一个值。
   * 
   * 默认情况下，该函数返回它看到的第一个值。 当 ignoreNulls 设置为 true 时，它将返回它看到的第一个非空
   * 值。 如果所有值都为空，则返回空值。
   * 注意：
   * 该函数是不确定的，因为它的结果取决于行的顺序，这在 shuffle 之后可能是不确定的。
   *
   * @group agg_funcs
   * @since 2.0.0
   */
  def first(e: Column, ignoreNulls: Boolean): Column = withAggregateFunction {
    First(e.expr, ignoreNulls)
  }

  /**
   * 聚合函数：返回一组中的第一个值。
   * 
   * 默认情况下，该函数返回它看到的第一个值。 当 ignoreNulls 设置为 true 时，它将返回它看到的第一个非空
   * 值。 如果所有值都为空，则返回空值。
   * 注意：
   * 该函数是不确定的，因为它的结果取决于行的顺序，这在 shuffle 之后可能是不确定的。
   *
   * @group agg_funcs
   * @since 2.0.0
   */
  def first(columnName: String, ignoreNulls: Boolean): Column = {
    first(Column(columnName), ignoreNulls)
  }

  /**
   * 聚合函数：返回一组中的第一个值。
   * 
   * 默认情况下，该函数返回它看到的第一个值。 当 ignoreNulls 设置为 true 时，它将返回它看到的第一个非空
   * 值。 如果所有值都为空，则返回空值。
   * 注意：
   * 该函数是不确定的，因为它的结果取决于行的顺序，这在 shuffle 之后可能是不确定的。
   *
   * @group agg_funcs
   * @since 1.3.0
   */
  def first(e: Column): Column = first(e, ignoreNulls = false)

  /**
   * 聚合函数：返回一组中的第一个值。
   * 
   * 默认情况下，该函数返回它看到的第一个值。 当 ignoreNulls 设置为 true 时，它将返回它看到的第一个非空
   * 值。 如果所有值都为空，则返回空值。
   * 注意：
   * 该函数是不确定的，因为它的结果取决于行的顺序，这在 shuffle 之后可能是不确定的。
   *
   * @group agg_funcs
   * @since 1.3.0
   */
  def first(columnName: String): Column = first(Column(columnName))

用法

========== df.select(first(col("age"), ignoreNulls = true)) ==========
 ---------- 
|first(age)|
 ---------- 
|        30|
 ---------- 

========== df.select(first("age", ignoreNulls = true)) ==========
 ---------- 
|first(age)|
 ---------- 
|        30|
 ---------- 

========== df.select(first(col("age"))) ==========
 ---------- 
|first(age)|
 ---------- 
|        30|
 ---------- 

========== df.select(first("age")) ==========
 ---------- 
|first(age)|
 ---------- 
|        30|
 ----------

grouping/grouping_id

  /**
   * 聚合函数：指示是否聚合GROUP BY列表中的指定列，返回1表示聚合，返回0表示结果集中未聚合。
   *
   * @group agg_funcs
   * @since 2.0.0
   */
  def grouping(e: Column): Column = Column(Grouping(e.expr))

  /**
   * 聚合函数：指示是否聚合GROUP BY列表中的指定列，返回1表示聚合，返回0表示结果集中未聚合。
   *
   * @group agg_funcs
   * @since 2.0.0
   */
  def grouping(columnName: String): Column = grouping(Column(columnName))

  /**
   * 聚合函数：返回分组级别，等于
   *
   * {{{
   *   (grouping(c1) <<; (n-1))   (grouping(c2) <<; (n-2))   ...   grouping(cn)
   * }}}
   *
   * @注意 列的列表应与分组列完全匹配，或为空（表示所有分组列）。
   * 
   * grouping columns).
   *
   * @group agg_funcs
   * @since 2.0.0
   */
  def grouping_id(cols: Column*): Column = Column(GroupingID(cols.map(_.expr)))

  /**
   * 聚合函数：返回分组级别，等于
   *
   * {{{
   *   (grouping(c1) <<; (n-1))   (grouping(c2) <<; (n-2))   ...   grouping(cn)
   * }}}
   *
   * @注意 列的列表应与分组列完全匹配，或为空（表示所有分组列）。
   *
   * @group agg_funcs
   * @since 2.0.0
   */
  def grouping_id(colName: String, colNames: String*): Column = {
    grouping_id((Seq(colName)    colNames).map(n => Column(n)) : _*)
  }

用法

========== df.cube("age", "salary").agg(grouping(col("age")), grouping(col("salary")), grouping_id(col("age"), col("salary"))) ==========
 ---- ------ ------------- ---------------- ------------------------ 
| age|salary|grouping(age)|grouping(salary)|grouping_id(age, salary)|
 ---- ------ ------------- ---------------- ------------------------ 
|null|  null|            0|               0|                       0|
|  24|  5000|            0|               0|                       0|
|  28|  7000|            0|               0|                       0|
|null|  5000|            1|               0|                       2|
|null|  7000|            1|               0|                       2|
|  24|  6000|            0|               0|                       0|
|null| 10000|            1|               0|                       2|
|null|  6000|            1|               0|                       2|
|  30|  null|            0|               1|                       1|
|null|  9000|            1|               0|                       2|
|  30|  8000|            0|               0|                       0|
|null|  null|            0|               1|                       1|
|  34|  null|            0|               1|                       1|
|  34|  9000|            0|               0|                       0|
|null|  null|            1|               1|                       3|
|  35|  null|            0|               1|                       1|
|null|  null|            1|               0|                       2|
|  24|  null|            0|               1|                       1|
|  28|  null|            0|               1|                       1|
|null|  8000|            1|               0|                       2|
 ---- ------ ------------- ---------------- ------------------------ 
only showing top 20 rows

========== df.cube("age", "salary").agg(grouping("age"), grouping("salary"), grouping_id("age", "salary")) ==========
 ---- ------ ------------- ---------------- ------------------------ 
| age|salary|grouping(age)|grouping(salary)|grouping_id(age, salary)|
 ---- ------ ------------- ---------------- ------------------------ 
|null|  null|            0|               0|                       0|
|  24|  5000|            0|               0|                       0|
|  28|  7000|            0|               0|                       0|
|null|  5000|            1|               0|                       2|
|null|  7000|            1|               0|                       2|
|  24|  6000|            0|               0|                       0|
|null| 10000|            1|               0|                       2|
|null|  6000|            1|               0|                       2|
|  30|  null|            0|               1|                       1|
|null|  9000|            1|               0|                       2|
|  30|  8000|            0|               0|                       0|
|null|  null|            0|               1|                       1|
|  34|  null|            0|               1|                       1|
|  34|  9000|            0|               0|                       0|
|null|  null|            1|               1|                       3|
|  35|  null|            0|               1|                       1|
|null|  null|            1|               0|                       2|
|  24|  null|            0|               1|                       1|
|  28|  null|            0|               1|                       1|
|null|  8000|            1|               0|                       2|
 ---- ------ ------------- ---------------- ------------------------ 
only showing top 20 rows

kurtosis

  /**
   * 聚合函数：返回组中值的峰度。
   *
   * @group agg_funcs
   * @since 1.6.0
   */
  def kurtosis(e: Column): Column = withAggregateFunction { Kurtosis(e.expr) }

  /**
   * 聚合函数：返回组中值的峰度。
   *
   * @group agg_funcs
   * @since 1.6.0
   */
  def kurtosis(columnName: String): Column = kurtosis(Column(columnName))

峰度（peakedness;kurtosis）又称峰态系数。表征概率密度分布曲线在平均值处峰值高低的特征数。直观看来，峰度反映了峰部的尖度。样本的峰度是和正态分布相比较而言统计量，如果峰度大于三，峰的形状比较尖，比正态分布峰要陡峭。反之亦然。
在统计学中，峰度（Kurtosis）衡量实数随机变量概率分布的峰态。峰度高就意味着方差增大是由低频度的大于或小于平均值的极端差值引起的。

用法

========== df.select(kurtosis(col("age"))) ==========
 ------------------- 
|      kurtosis(age)|
 ------------------- 
|-1.5243591393954996|
 ------------------- 

========== df.select(kurtosis("age")) ==========
 ------------------- 
|      kurtosis(age)|
 ------------------- 
|-1.5243591393954996|
 -------------------

last

  /**
   * 聚合函数：返回组中的最后一个值。 
   * 默认情况下，该函数返回它看到的最后一个值。
   * 当ignoreNulls设置为true时，它将返回最后一个看到的非null值。
   * 如果所有值都为null，则返回null。 
   * 注意: 该函数是不确定的，因为它的结果取决于行的顺序，而行的顺序在 Shuffle 后可能是不确定的。
   *
   * @group agg_funcs
   * @since 2.0.0
   */
  def last(e: Column, ignoreNulls: Boolean): Column = withAggregateFunction {
    new Last(e.expr, ignoreNulls)
  }

  /**
   * 聚合函数：返回组中的最后一个值。 
   * 默认情况下，该函数返回它看到的最后一个值。
   * 当ignoreNulls设置为true时，它将返回最后一个看到的非null值。
   * 如果所有值都为null，则返回null。 
   * 注意: 该函数是不确定的，因为它的结果取决于行的顺序，而行的顺序在 Shuffle 后可能是不确定的。
   *
   * @group agg_funcs
   * @since 2.0.0
   */
  def last(columnName: String, ignoreNulls: Boolean): Column = {
    last(Column(columnName), ignoreNulls)
  }

  /**
   * 聚合函数：返回组中的最后一个值。 
   * 默认情况下，该函数返回它看到的最后一个值。
   * 当ignoreNulls设置为true时，它将返回最后一个看到的非null值。
   * 如果所有值都为null，则返回null。 
   * 注意: 该函数是不确定的，因为它的结果取决于行的顺序，而行的顺序在 Shuffle 后可能是不确定的。
   *
   * @group agg_funcs
   * @since 1.3.0
   */
  def last(e: Column): Column = last(e, ignoreNulls = false)

  /**
   *    * 聚合函数：返回组中的最后一个值。 
   * 默认情况下，该函数返回它看到的最后一个值。
   * 当ignoreNulls设置为true时，它将返回最后一个看到的非null值。
   * 如果所有值都为null，则返回null。 
   * 注意: 该函数是不确定的，因为它的结果取决于行的顺序，而行的顺序在 Shuffle 后可能是不确定的。
   *
   * @group agg_funcs
   * @since 1.3.0
   */
  def last(columnName: String): Column = last(Column(columnName), ignoreNulls = false)

用法

========== df.select(last(col("age"), ignoreNulls = true)) ==========
 --------- 
|last(age)|
 --------- 
|       24|
 --------- 

========== df.select(last("age", ignoreNulls = true)) ==========
 --------- 
|last(age)|
 --------- 
|       24|
 --------- 

========== df.select(last(col("age"))) ==========
 --------- 
|last(age)|
 --------- 
|       24|
 --------- 

========== df.select(last("age")) ==========
 --------- 
|last(age)|
 --------- 
|       24|
 ---------

max/max_by/mean/min/min_by

  /**
   * 聚合函数：返回组中表达式的最大值。
   *
   * @group agg_funcs
   * @since 1.3.0
   */
  def max(e: Column): Column = withAggregateFunction { Max(e.expr) }

  /**
   * 聚合函数：返回组中表达式的最大值。
   *
   * @group agg_funcs
   * @since 1.3.0
   */
  def max(columnName: String): Column = max(Column(columnName))

  /**
   * 聚合函数：返回与ord最大值关联的值。
   *
   * @group agg_funcs
   * @since 3.3.0
   */
  def max_by(e: Column, ord: Column): Column = withAggregateFunction { MaxBy(e.expr, ord.expr) }

  /**
   * 聚合函数：返回组中值的平均值。
   * "avg" 的别名。
   *
   * @group agg_funcs
   * @since 1.4.0
   */
  def mean(e: Column): Column = avg(e)

  /**
   * 聚合函数：返回组中值的平均值。
   * "avg" 的别名。
   *
   * @group agg_funcs
   * @since 1.4.0
   */
  def mean(columnName: String): Column = avg(columnName)

  /**
   * 聚合函数：返回组中表达式的最小值。
   *
   * @group agg_funcs
   * @since 1.3.0
   */
  def min(e: Column): Column = withAggregateFunction { Min(e.expr) }

  /**
   * 聚合函数：返回组中表达式的最小值。
   *
   * @group agg_funcs
   * @since 1.3.0
   */
  def min(columnName: String): Column = min(Column(columnName))

  /**
   * 聚合函数：返回与ord最小值关联的值。
   *
   * @group agg_funcs
   * @since 3.3.0
   */
  def min_by(e: Column, ord: Column): Column = withAggregateFunction { MinBy(e.expr, ord.expr) }

用法

========== df.select(max(col("age"))) ==========
 -------- 
|max(age)|
 -------- 
|      35|
 -------- 

========== df.select(max("age")) ==========
 -------- 
|max(age)|
 -------- 
|      35|
 -------- 

========== df.select(mean(col("age"))) ==========
 ------------------ 
|          avg(age)|
 ------------------ 
|29.166666666666668|
 ------------------ 

========== df.select(mean("age")) ==========
 ------------------ 
|          avg(age)|
 ------------------ 
|29.166666666666668|
 ------------------ 

========== df.select(min(col("age"))) ==========
 -------- 
|min(age)|
 -------- 
|      24|
 -------- 

========== df.select(min("age")) ==========
 -------- 
|min(age)|
 -------- 
|      24|
 --------

percentile_approx

  /**
   * 聚合函数：返回数值列'col'的近似百分比，该数值列'col'是有序'col'值（从最小值到最大值排序）中的最
   * 小值，因此col值小于或等于该值的百分比不超过'percentage'。 
   * 如果'percentage'是数组，则每个值必须介于0.0和1.0之间。
   * 如果是单浮点值，则必须介于0.0和1.0之间。 
   * 精度参数'accuracy'是一个正数值文字，它以内存为代价控制近似精度。
   * 精度值越高，精度越好，1.0/accuracy 是近似值的相对误差。
   *
   * @group agg_funcs
   * @since 3.1.0
   */
  def percentile_approx(e: Column, percentage: Column, accuracy: Column): Column = {
    withAggregateFunction {
      new ApproximatePercentile(
        e.expr, percentage.expr, accuracy.expr
      )
    }
  }

用法

========== df.select(percentile_approx(col("age"), lit(0.5), lit(50))) ==========
 ------------------------------- 
|percentile_approx(age, 0.5, 50)|
 ------------------------------- 
|                           28.0|
 -------------------------------

product

  /**
   * 聚合函数：返回组中所有数值元素的乘积。
   *
   * @group agg_funcs
   * @since 3.2.0
   */
  def product(e: Column): Column =
    withAggregateFunction { new Product(e.expr) }

用法

========== df.select(product(col("age"))) ==========
 ------------ 
|product(age)|
 ------------ 
|  5.757696E8|
 ------------

skewness

  /**
   * 聚合函数：返回组中值的偏度。
   *
   * @group agg_funcs
   * @since 1.6.0
   */
  def skewness(e: Column): Column = withAggregateFunction { Skewness(e.expr) }

  /**
   * 聚合函数：返回组中值的偏度。
   *
   * @group agg_funcs
   * @since 1.6.0
   */
  def skewness(columnName: String): Column = skewness(Column(columnName))

偏度（skewness），是统计数据分布偏斜方向和程度的度量，是统计数据分布非对称程度的数字特征。偏度(Skewness)亦称偏态、偏态系数。

用法

========== df.select(skewness(col("age"))) ==========
 ------------------- 
|      skewness(age)|
 ------------------- 
|0.07062157148286327|
 ------------------- 

========== df.select(skewness("age")) ==========
 ------------------- 
|      skewness(age)|
 ------------------- 
|0.07062157148286327|
 -------------------

stddev/stddev_samp/stddev_pop

  /**
   * 聚合函数： `stddev_samp` 的别名
   *
   * @group agg_funcs
   * @since 1.6.0
   */
  def stddev(e: Column): Column = withAggregateFunction { StddevSamp(e.expr) }

  /**
   * 聚合函数： `stddev_samp` 的别名
   *
   * @group agg_funcs
   * @since 1.6.0
   */
  def stddev(columnName: String): Column = stddev(Column(columnName))

  /**
   * 聚合函数：返回组中表达式的样本标准差。
   *
   * @group agg_funcs
   * @since 1.6.0
   */
  def stddev_samp(e: Column): Column = withAggregateFunction { StddevSamp(e.expr) }

  /**
   * 聚合函数：返回组中表达式的样本标准差。
   *
   * @group agg_funcs
   * @since 1.6.0
   */
  def stddev_samp(columnName: String): Column = stddev_samp(Column(columnName))

  /**
   * 聚合函数：返回组中表达式的总体标准差。
   *
   * @group agg_funcs
   * @since 1.6.0
   */
  def stddev_pop(e: Column): Column = withAggregateFunction { StddevPop(e.expr) }

  /**
   * 聚合函数：返回组中表达式的总体标准差。
   *
   * @group agg_funcs
   * @since 1.6.0
   */
  def stddev_pop(columnName: String): Column = stddev_pop(Column(columnName))

用法

========== df.select(stddev(col("age"))) ==========
 ----------------- 
| stddev_samp(age)|
 ----------------- 
|4.750438576243952|
 ----------------- 

========== df.select(stddev("age")) ==========
 ----------------- 
| stddev_samp(age)|
 ----------------- 
|4.750438576243952|
 ----------------- 

========== df.select(stddev_samp(col("age"))) ==========
 ----------------- 
| stddev_samp(age)|
 ----------------- 
|4.750438576243952|
 ----------------- 

========== df.select(stddev_samp("age")) ==========
 ----------------- 
| stddev_samp(age)|
 ----------------- 
|4.750438576243952|
 ----------------- 

========== df.select(stddev_pop(col("age"))) ==========
 ----------------- 
|  stddev_pop(age)|
 ----------------- 
|4.336537277085896|
 ----------------- 

========== df.select(stddev_pop("age")) ==========
 ----------------- 
|  stddev_pop(age)|
 ----------------- 
|4.336537277085896|
 -----------------

sum/sum_distinct

  /**
   * 聚合函数：返回表达式中所有值的总和。
   *
   * @group agg_funcs
   * @since 1.3.0
   */
  def sum(e: Column): Column = withAggregateFunction { Sum(e.expr) }

  /**
   * 聚合函数：返回表达式中所有值的总和。
   *
   * @group agg_funcs
   * @since 1.3.0
   */
  def sum(columnName: String): Column = sum(Column(columnName))

  /**
   * 聚合函数：返回表达式中不同值的总和。
   *
   * @group agg_funcs
   * @since 1.3.0
   */
  @deprecated("Use sum_distinct", "3.2.0")
  def sumDistinct(e: Column): Column = withAggregateFunction(Sum(e.expr), isDistinct = true)

  /**
   * 聚合函数：返回表达式中不同值的总和。
   *
   * @group agg_funcs
   * @since 1.3.0
   */
  @deprecated("Use sum_distinct", "3.2.0")
  def sumDistinct(columnName: String): Column = sum_distinct(Column(columnName))

  /**
   * 聚合函数：返回表达式中不同值的总和。
   *
   * @group agg_funcs
   * @since 3.2.0
   */
  def sum_distinct(e: Column): Column = withAggregateFunction(Sum(e.expr), isDistinct = true)

用法

========== df.select(sum(col("age"))) ==========
 -------- 
|sum(age)|
 -------- 
|   175.0|
 -------- 

========== df.select(sum("age")) ==========
 -------- 
|sum(age)|
 -------- 
|   175.0|
 -------- 

========== df.select(sumDistinct(col("age"))) ==========
 ----------------- 
|sum(DISTINCT age)|
 ----------------- 
|            151.0|
 ----------------- 

========== df.select(sumDistinct("age")) ==========
 ----------------- 
|sum(DISTINCT age)|
 ----------------- 
|            151.0|
 ----------------- 

========== df.select(sum_distinct(col("age"))) ==========
 ----------------- 
|sum(DISTINCT age)|
 ----------------- 
|            151.0|
 -----------------

variance/var_samp/var_pop

  /**
   * 聚合函数： `var_samp` 的别名
   *
   * @group agg_funcs
   * @since 1.6.0
   */
  def variance(e: Column): Column = withAggregateFunction { VarianceSamp(e.expr) }

  /**
   * 聚合函数： `var_samp` 的别名
   *
   * @group agg_funcs
   * @since 1.6.0
   */
  def variance(columnName: String): Column = variance(Column(columnName))

  /**
   * 聚合函数：返回组中值的无偏方差。
   *
   * @group agg_funcs
   * @since 1.6.0
   */
  def var_samp(e: Column): Column = withAggregateFunction { VarianceSamp(e.expr) }

  /**
   * 聚合函数：返回组中值的无偏方差。
   *
   * @group agg_funcs
   * @since 1.6.0
   */
  def var_samp(columnName: String): Column = var_samp(Column(columnName))

  /**
   * 聚合函数：返回组中值的总体方差。
   *
   * @group agg_funcs
   * @since 1.6.0
   */
  def var_pop(e: Column): Column = withAggregateFunction { VariancePop(e.expr) }

  /**
   * 聚合函数：返回组中值的总体方差。
   *
   * @group agg_funcs
   * @since 1.6.0
   */
  def var_pop(columnName: String): Column = var_pop(Column(columnName))

用法

========== df.select(variance(col("age"))) ==========
 ----------------- 
|    var_samp(age)|
 ----------------- 
|22.56666666666667|
 ----------------- 

========== df.select(variance("age")) ==========
 ----------------- 
|    var_samp(age)|
 ----------------- 
|22.56666666666667|
 ----------------- 

========== df.select(var_samp(col("age"))) ==========
 ----------------- 
|    var_samp(age)|
 ----------------- 
|22.56666666666667|
 ----------------- 

========== df.select(var_samp("age")) ==========
 ----------------- 
|    var_samp(age)|
 ----------------- 
|22.56666666666667|
 ----------------- 

========== df.select(var_pop(col("age"))) ==========
 ------------------ 
|      var_pop(age)|
 ------------------ 
|18.805555555555557|
 ------------------ 

========== df.select(var_pop("age")) ==========
 ------------------ 
|      var_pop(age)|
 ------------------ 
|18.805555555555557|
 ------------------

实践

数据

employees.csv

Alan,PD,30,8000
Bob,HR,28,7000
Bruce,PD,24,5000
Charles,ED,34,9000
David,,35,10000
Joan,PD,,
Tony,HR,24,6000

代码

package com.shockang.study.spark.sql.functions

import com.shockang.study.spark.util.Utils.formatPrint
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

/**
 * Aggregate functions
 *
 * @author Shockang
 */
object AggregateFunctionsExample {
  val DATA_PATH = "/Users/shockang/code/spark-examples/data/simple/read/employees.csv"

  def main(args: Array[String]): Unit = {
    Logger.getLogger("org").setLevel(Level.OFF)
    val spark = SparkSession.builder().appName("AggregateFunctionsExample").master("local[*]").getOrCreate()
    val df = spark.read.csv(DATA_PATH).toDF("name", "dept", "age", "salary").cache()

    // approx_count_distinct
    formatPrint("""df.select(approxCountDistinct(col("age")))""")
    df.select(approxCountDistinct(col("age"))).show()

    formatPrint("""df.select(approxCountDistinct("age"))""")
    df.select(approxCountDistinct("age")).show()

    formatPrint("""df.select(approxCountDistinct(col("age"), 0.05))""")
    df.select(approxCountDistinct(col("age"), 0.05)).show()

    formatPrint("""df.select(approxCountDistinct("age", 0.05))""")
    df.select(approxCountDistinct("age", 0.05)).show()

    formatPrint("""df.select(approx_count_distinct(col("age")))""")
    df.select(approx_count_distinct(col("age"))).show()

    formatPrint("""df.select(approx_count_distinct("age"))""")
    df.select(approx_count_distinct("age")).show()

    formatPrint("""df.select(approx_count_distinct(col("age"), 0.05))""")
    df.select(approx_count_distinct(col("age"), 0.05)).show()

    formatPrint("""df.select(approx_count_distinct("age", 0.05))""")
    df.select(approx_count_distinct("age", 0.05)).show()

    // avg
    formatPrint("""df.select(avg(col("age")))""")
    df.select(avg(col("age"))).show()
    formatPrint("""df.select(avg("age"))""")
    df.select(avg("age")).show()

    // collect_list/collect_set
    formatPrint("""df.select(collect_list(col("age")))""")
    df.select(collect_list(col("age"))).show()

    formatPrint("""df.select(collect_list("age"))""")
    df.select(collect_list("age")).show()

    // collect_list/collect_set
    formatPrint("""df.select(collect_set(col("age")))""")
    df.select(collect_set(col("age"))).show()

    formatPrint("""df.select(collect_set("age"))""")
    df.select(collect_set("age")).show()

    // corr
    formatPrint("""df.select(corr(col("age"), col("salary")))""")
    df.select(corr(col("age"), col("salary"))).show()

    formatPrint("""df.select(corr("age", "salary"))""")
    df.select(corr("age", "salary")).show()

    // count/count_distinct
    formatPrint("""df.select(count(col("age")))""")
    df.select(count(col("age"))).show()

    formatPrint("""df.select(count("age"))""")
    df.select(count("age")).show()

    formatPrint("""df.select(countDistinct(col("age")))""")
    df.select(countDistinct(col("age"))).show()

    formatPrint("""df.select(countDistinct("age"))""")
    df.select(countDistinct("age")).show()

    formatPrint("""df.select(count_distinct(col("age")))""")
    df.select(count_distinct(col("age"))).show()

    formatPrint("""df.select(count_distinct(col("age"), col("salary")))""")
    df.select(count_distinct(col("age"), col("salary"))).show()

    // covar_pop/covar_samp
    formatPrint("""df.select(covar_pop(col("age"), col("salary")))""")
    df.select(covar_pop(col("age"), col("salary"))).show()

    formatPrint("""df.select(covar_pop("age", "salary"))""")
    df.select(covar_pop("age", "salary")).show()

    formatPrint("""df.select(covar_samp(col("age"), col("salary")))""")
    df.select(covar_samp(col("age"), col("salary"))).show()

    formatPrint("""df.select(covar_samp("age", "salary"))""")
    df.select(covar_samp("age", "salary")).show()

    // first
    formatPrint("""df.select(first(col("age"), ignoreNulls = true))""")
    df.select(first(col("age"), ignoreNulls = true)).show()

    formatPrint("""df.select(first("age", ignoreNulls = true))""")
    df.select(first("age", ignoreNulls = true)).show()

    formatPrint("""df.select(first(col("age")))""")
    df.select(first(col("age"))).show()

    formatPrint("""df.select(first("age"))""")
    df.select(first("age")).show()

    // grouping/grouping_id
    formatPrint("""df.cube("age", "salary").agg(grouping(col("age")), grouping(col("salary")), grouping_id(col("age"), col("salary")))""")
    df.cube("age", "salary").agg(grouping(col("age")), grouping(col("salary")), grouping_id(col("age"), col("salary"))).show()

    formatPrint("""df.cube("age", "salary").agg(grouping("age"), grouping("salary"), grouping_id("age", "salary"))""")
    df.cube("age", "salary").agg(grouping("age"), grouping("salary"), grouping_id("age", "salary")).show()

    // kurtosis
    formatPrint("""df.select(kurtosis(col("age")))""")
    df.select(kurtosis(col("age"))).show()

    formatPrint("""df.select(kurtosis("age"))""")
    df.select(kurtosis("age")).show()

    // last
    formatPrint("""df.select(last(col("age"), ignoreNulls = true))""")
    df.select(last(col("age"), ignoreNulls = true)).show()

    formatPrint("""df.select(last("age", ignoreNulls = true))""")
    df.select(last("age", ignoreNulls = true)).show()

    formatPrint("""df.select(last(col("age")))""")
    df.select(last(col("age"))).show()

    formatPrint("""df.select(last("age"))""")
    df.select(last("age")).show()

    // max/mean/min
    formatPrint("""df.select(max(col("age")))""")
    df.select(max(col("age"))).show()

    formatPrint("""df.select(max("age"))""")
    df.select(max("age")).show()

    formatPrint("""df.select(mean(col("age")))""")
    df.select(mean(col("age"))).show()

    formatPrint("""df.select(mean("age"))""")
    df.select(mean("age")).show()

    formatPrint("""df.select(min(col("age")))""")
    df.select(min(col("age"))).show()

    formatPrint("""df.select(min("age"))""")
    df.select(min("age")).show()

    // percentile_approx
    formatPrint("""df.select(percentile_approx(col("age"), lit(0.5), lit(50)))""")
    df.select(percentile_approx(col("age"), lit(0.5), lit(50))).show()

    // product
    formatPrint("""df.select(product(col("age")))""")
    df.select(product(col("age"))).show()

    // skewness
    formatPrint("""df.select(skewness(col("age")))""")
    df.select(skewness(col("age"))).show()

    formatPrint("""df.select(skewness("age"))""")
    df.select(skewness("age")).show()

    // stddev/stddev_samp/stddev_pop
    formatPrint("""df.select(stddev(col("age")))""")
    df.select(stddev(col("age"))).show()

    formatPrint("""df.select(stddev("age"))""")
    df.select(stddev("age")).show()

    formatPrint("""df.select(stddev_samp(col("age")))""")
    df.select(stddev_samp(col("age"))).show()

    formatPrint("""df.select(stddev_samp("age"))""")
    df.select(stddev_samp("age")).show()

    formatPrint("""df.select(stddev_pop(col("age")))""")
    df.select(stddev_pop(col("age"))).show()

    formatPrint("""df.select(stddev_pop("age"))""")
    df.select(stddev_pop("age")).show()

    // sum/sum_distinct
    formatPrint("""df.select(sum(col("age")))""")
    df.select(sum(col("age"))).show()

    formatPrint("""df.select(sum("age"))""")
    df.select(sum("age")).show()

    formatPrint("""df.select(sumDistinct(col("age")))""")
    df.select(sumDistinct(col("age"))).show()

    formatPrint("""df.select(sumDistinct("age"))""")
    df.select(sumDistinct("age")).show()

    formatPrint("""df.select(sum_distinct(col("age")))""")
    df.select(sum_distinct(col("age"))).show()

    // variance/var_samp/var_pop
    formatPrint("""df.select(variance(col("age")))""")
    df.select(variance(col("age"))).show()

    formatPrint("""df.select(variance("age"))""")
    df.select(variance("age")).show()

    formatPrint("""df.select(var_samp(col("age")))""")
    df.select(var_samp(col("age"))).show()

    formatPrint("""df.select(var_samp("age"))""")
    df.select(var_samp("age")).show()

    formatPrint("""df.select(var_pop(col("age")))""")
    df.select(var_pop(col("age"))).show()

    formatPrint("""df.select(var_pop("age"))""")
    df.select(var_pop("age")).show()
  }
}

输出

========== df.select(approxCountDistinct(col("age"))) ==========
 -------------------------- 
|approx_count_distinct(age)|
 -------------------------- 
|                         5|
 -------------------------- 

========== df.select(approxCountDistinct("age")) ==========
 -------------------------- 
|approx_count_distinct(age)|
 -------------------------- 
|                         5|
 -------------------------- 

========== df.select(approxCountDistinct(col("age"), 0.05)) ==========
 -------------------------- 
|approx_count_distinct(age)|
 -------------------------- 
|                         5|
 -------------------------- 

========== df.select(approxCountDistinct("age", 0.05)) ==========
 -------------------------- 
|approx_count_distinct(age)|
 -------------------------- 
|                         5|
 -------------------------- 

========== df.select(approx_count_distinct(col("age"))) ==========
 -------------------------- 
|approx_count_distinct(age)|
 -------------------------- 
|                         5|
 -------------------------- 

========== df.select(approx_count_distinct("age")) ==========
 -------------------------- 
|approx_count_distinct(age)|
 -------------------------- 
|                         5|
 -------------------------- 

========== df.select(approx_count_distinct(col("age"), 0.05)) ==========
 -------------------------- 
|approx_count_distinct(age)|
 -------------------------- 
|                         5|
 -------------------------- 

========== df.select(approx_count_distinct("age", 0.05)) ==========
 -------------------------- 
|approx_count_distinct(age)|
 -------------------------- 
|                         5|
 -------------------------- 

========== df.select(avg(col("age"))) ==========
 ------------------ 
|          avg(age)|
 ------------------ 
|29.166666666666668|
 ------------------ 

========== df.select(avg("age")) ==========
 ------------------ 
|          avg(age)|
 ------------------ 
|29.166666666666668|
 ------------------ 

========== df.select(collect_list(col("age"))) ==========
 -------------------- 
|   collect_list(age)|
 -------------------- 
|[30, 28, 24, 34, ...|
 -------------------- 

========== df.select(collect_list("age")) ==========
 -------------------- 
|   collect_list(age)|
 -------------------- 
|[30, 28, 24, 34, ...|
 -------------------- 

========== df.select(collect_set(col("age"))) ==========
 -------------------- 
|    collect_set(age)|
 -------------------- 
|[30, 24, 28, 34, 35]|
 -------------------- 

========== df.select(collect_set("age")) ==========
 -------------------- 
|    collect_set(age)|
 -------------------- 
|[30, 24, 28, 34, 35]|
 -------------------- 

========== df.select(corr(col("age"), col("salary"))) ==========
 ------------------ 
| corr(age, salary)|
 ------------------ 
|0.9789297418835571|
 ------------------ 

========== df.select(corr("age", "salary")) ==========
 ------------------ 
| corr(age, salary)|
 ------------------ 
|0.9789297418835571|
 ------------------ 

========== df.select(count(col("age"))) ==========
 ---------- 
|count(age)|
 ---------- 
|         6|
 ---------- 

========== df.select(count("age")) ==========
 ---------- 
|count(age)|
 ---------- 
|         6|
 ---------- 

========== df.select(countDistinct(col("age"))) ==========
 ------------------- 
|count(DISTINCT age)|
 ------------------- 
|                  5|
 ------------------- 

========== df.select(countDistinct("age")) ==========
 ------------------- 
|count(DISTINCT age)|
 ------------------- 
|                  5|
 ------------------- 

========== df.select(count_distinct(col("age"))) ==========
 ------------------- 
|count(DISTINCT age)|
 ------------------- 
|                  5|
 ------------------- 

========== df.select(count_distinct(col("age"), col("salary"))) ==========
 --------------------------- 
|count(DISTINCT age, salary)|
 --------------------------- 
|                          6|
 --------------------------- 

========== df.select(covar_pop(col("age"), col("salary"))) ==========
 ---------------------- 
|covar_pop(age, salary)|
 ---------------------- 
|                7250.0|
 ---------------------- 

========== df.select(covar_pop("age", "salary")) ==========
 ---------------------- 
|covar_pop(age, salary)|
 ---------------------- 
|                7250.0|
 ---------------------- 

========== df.select(covar_samp(col("age"), col("salary"))) ==========
 ----------------------- 
|covar_samp(age, salary)|
 ----------------------- 
|                 8700.0|
 ----------------------- 

========== df.select(covar_samp("age", "salary")) ==========
 ----------------------- 
|covar_samp(age, salary)|
 ----------------------- 
|                 8700.0|
 ----------------------- 

========== df.select(first(col("age"), ignoreNulls = true)) ==========
 ---------- 
|first(age)|
 ---------- 
|        30|
 ---------- 

========== df.select(first("age", ignoreNulls = true)) ==========
 ---------- 
|first(age)|
 ---------- 
|        30|
 ---------- 

========== df.select(first(col("age"))) ==========
 ---------- 
|first(age)|
 ---------- 
|        30|
 ---------- 

========== df.select(first("age")) ==========
 ---------- 
|first(age)|
 ---------- 
|        30|
 ---------- 

========== df.cube("age", "salary").agg(grouping(col("age")), grouping(col("salary")), grouping_id(col("age"), col("salary"))) ==========
 ---- ------ ------------- ---------------- ------------------------ 
| age|salary|grouping(age)|grouping(salary)|grouping_id(age, salary)|
 ---- ------ ------------- ---------------- ------------------------ 
|null|  null|            0|               0|                       0|
|  24|  5000|            0|               0|                       0|
|  28|  7000|            0|               0|                       0|
|null|  5000|            1|               0|                       2|
|null|  7000|            1|               0|                       2|
|  24|  6000|            0|               0|                       0|
|null| 10000|            1|               0|                       2|
|null|  6000|            1|               0|                       2|
|  30|  null|            0|               1|                       1|
|null|  9000|            1|               0|                       2|
|  30|  8000|            0|               0|                       0|
|null|  null|            0|               1|                       1|
|  34|  null|            0|               1|                       1|
|  34|  9000|            0|               0|                       0|
|null|  null|            1|               1|                       3|
|  35|  null|            0|               1|                       1|
|null|  null|            1|               0|                       2|
|  24|  null|            0|               1|                       1|
|  28|  null|            0|               1|                       1|
|null|  8000|            1|               0|                       2|
 ---- ------ ------------- ---------------- ------------------------ 
only showing top 20 rows

========== df.cube("age", "salary").agg(grouping("age"), grouping("salary"), grouping_id("age", "salary")) ==========
 ---- ------ ------------- ---------------- ------------------------ 
| age|salary|grouping(age)|grouping(salary)|grouping_id(age, salary)|
 ---- ------ ------------- ---------------- ------------------------ 
|null|  null|            0|               0|                       0|
|  24|  5000|            0|               0|                       0|
|  28|  7000|            0|               0|                       0|
|null|  5000|            1|               0|                       2|
|null|  7000|            1|               0|                       2|
|  24|  6000|            0|               0|                       0|
|null| 10000|            1|               0|                       2|
|null|  6000|            1|               0|                       2|
|  30|  null|            0|               1|                       1|
|null|  9000|            1|               0|                       2|
|  30|  8000|            0|               0|                       0|
|null|  null|            0|               1|                       1|
|  34|  null|            0|               1|                       1|
|  34|  9000|            0|               0|                       0|
|null|  null|            1|               1|                       3|
|  35|  null|            0|               1|                       1|
|null|  null|            1|               0|                       2|
|  24|  null|            0|               1|                       1|
|  28|  null|            0|               1|                       1|
|null|  8000|            1|               0|                       2|
 ---- ------ ------------- ---------------- ------------------------ 
only showing top 20 rows

========== df.select(kurtosis(col("age"))) ==========
 ------------------- 
|      kurtosis(age)|
 ------------------- 
|-1.5243591393954996|
 ------------------- 

========== df.select(kurtosis("age")) ==========
 ------------------- 
|      kurtosis(age)|
 ------------------- 
|-1.5243591393954996|
 ------------------- 

========== df.select(last(col("age"), ignoreNulls = true)) ==========
 --------- 
|last(age)|
 --------- 
|       24|
 --------- 

========== df.select(last("age", ignoreNulls = true)) ==========
 --------- 
|last(age)|
 --------- 
|       24|
 --------- 

========== df.select(last(col("age"))) ==========
 --------- 
|last(age)|
 --------- 
|       24|
 --------- 

========== df.select(last("age")) ==========
 --------- 
|last(age)|
 --------- 
|       24|
 --------- 

========== df.select(max(col("age"))) ==========
 -------- 
|max(age)|
 -------- 
|      35|
 -------- 

========== df.select(max("age")) ==========
 -------- 
|max(age)|
 -------- 
|      35|
 -------- 

========== df.select(mean(col("age"))) ==========
 ------------------ 
|          avg(age)|
 ------------------ 
|29.166666666666668|
 ------------------ 

========== df.select(mean("age")) ==========
 ------------------ 
|          avg(age)|
 ------------------ 
|29.166666666666668|
 ------------------ 

========== df.select(min(col("age"))) ==========
 -------- 
|min(age)|
 -------- 
|      24|
 -------- 

========== df.select(min("age")) ==========
 -------- 
|min(age)|
 -------- 
|      24|
 -------- 

========== df.select(percentile_approx(col("age"), lit(0.5), lit(50))) ==========
 ------------------------------- 
|percentile_approx(age, 0.5, 50)|
 ------------------------------- 
|                           28.0|
 ------------------------------- 

========== df.select(product(col("age"))) ==========
 ------------ 
|product(age)|
 ------------ 
|  5.757696E8|
 ------------ 

========== df.select(skewness(col("age"))) ==========
 ------------------- 
|      skewness(age)|
 ------------------- 
|0.07062157148286327|
 ------------------- 

========== df.select(skewness("age")) ==========
 ------------------- 
|      skewness(age)|
 ------------------- 
|0.07062157148286327|
 ------------------- 

========== df.select(stddev(col("age"))) ==========
 ----------------- 
| stddev_samp(age)|
 ----------------- 
|4.750438576243952|
 ----------------- 

========== df.select(stddev("age")) ==========
 ----------------- 
| stddev_samp(age)|
 ----------------- 
|4.750438576243952|
 ----------------- 

========== df.select(stddev_samp(col("age"))) ==========
 ----------------- 
| stddev_samp(age)|
 ----------------- 
|4.750438576243952|
 ----------------- 

========== df.select(stddev_samp("age")) ==========
 ----------------- 
| stddev_samp(age)|
 ----------------- 
|4.750438576243952|
 ----------------- 

========== df.select(stddev_pop(col("age"))) ==========
 ----------------- 
|  stddev_pop(age)|
 ----------------- 
|4.336537277085896|
 ----------------- 

========== df.select(stddev_pop("age")) ==========
 ----------------- 
|  stddev_pop(age)|
 ----------------- 
|4.336537277085896|
 ----------------- 

========== df.select(sum(col("age"))) ==========
 -------- 
|sum(age)|
 -------- 
|   175.0|
 -------- 

========== df.select(sum("age")) ==========
 -------- 
|sum(age)|
 -------- 
|   175.0|
 -------- 

========== df.select(sumDistinct(col("age"))) ==========
 ----------------- 
|sum(DISTINCT age)|
 ----------------- 
|            151.0|
 ----------------- 

========== df.select(sumDistinct("age")) ==========
 ----------------- 
|sum(DISTINCT age)|
 ----------------- 
|            151.0|
 ----------------- 

========== df.select(sum_distinct(col("age"))) ==========
 ----------------- 
|sum(DISTINCT age)|
 ----------------- 
|            151.0|
 ----------------- 

========== df.select(variance(col("age"))) ==========
 ----------------- 
|    var_samp(age)|
 ----------------- 
|22.56666666666667|
 ----------------- 

========== df.select(variance("age")) ==========
 ----------------- 
|    var_samp(age)|
 ----------------- 
|22.56666666666667|
 ----------------- 

========== df.select(var_samp(col("age"))) ==========
 ----------------- 
|    var_samp(age)|
 ----------------- 
|22.56666666666667|
 ----------------- 

========== df.select(var_samp("age")) ==========
 ----------------- 
|    var_samp(age)|
 ----------------- 
|22.56666666666667|
 ----------------- 

========== df.select(var_pop(col("age"))) ==========
 ------------------ 
|      var_pop(age)|
 ------------------ 
|18.805555555555557|
 ------------------ 

========== df.select(var_pop("age")) ==========
 ------------------ 
|      var_pop(age)|
 ------------------ 
|18.805555555555557|
 ------------------

这篇好文章是转载于：学新通技术网

Spark SQL functions.scala 源码二Aggregate functions Spark 3.3.0

前言

目录

关联

正文

approx_count_distinct

用法

avg

用法

collect_list/collect_set

用法

corr

用法

count/count_distinct

用法

covar_pop/covar_samp

用法

first

用法

grouping/grouping_id

用法

kurtosis

用法

last

用法

max/max_by/mean/min/min_by

用法

percentile_approx

用法

product

用法

skewness

用法

stddev/stddev_samp/stddev_pop

用法

sum/sum_distinct

用法

variance/var_samp/var_pop

用法

实践

数据

employees.csv

代码

输出

photoshop保存的图片太大微信发不了怎么办

《学习通》视频自动暂停处理方法

word里面弄一个表格后上面的标题会跑到下面怎么办

Android 11 保存文件到外部存储，并分享文件

photoshop扩展功能面板显示灰色怎么办

微信公众号没有声音提示怎么办

excel下划线不显示怎么办

excel打印预览压线压字怎么办

TikTok加速器哪个好免费的TK加速器推荐

怎样阻止微信小程序自动打开