spark java seq_【Spark Java API】Action(5)—treeAggregate、treeReduce

treeAggregate官方文档描述：AggregatestheelementsofthisRDDinamulti-leveltreepattern.函数原型：deftreeAggregate[U](zeroValue:U,seqOp:JFunction2[U,T,U],combOp:JFunction2[U,U,U],depth:Int):Udeftr...

龚禧学长

415人浏览 · 2021-02-27 00:04:53

龚禧学长 · 2021-02-27 00:04:53 发布

treeAggregate

官方文档描述：Aggregates the elements of this RDD in a multi-level tree pattern.

函数原型：def treeAggregate[U](

zeroValue: U,

seqOp: JFunction2[U, T, U],

combOp: JFunction2[U, U, U],

depth: Int): U

def treeAggregate[U](

zeroValue: U,

seqOp: JFunction2[U, T, U],

combOp: JFunction2[U, U, U]): U

可理解为更复杂的多阶aggregate。

源码分析：def treeAggregate[U: ClassTag](zeroValue: U)(

seqOp: (U, T) => U,

combOp: (U, U) => U,

depth: Int = 2): U = withScope {

require(depth >= 1, s"Depth must be greater than or equal to 1 but got $depth.")

if (partitions.length == 0) {

Utils.clone(zeroValue, context.env.closureSerializer.newInstance())

} else {

val cleanSeqOp = context.clean(seqOp)

val cleanCombOp = context.clean(combOp)

val aggregatePartition =

(it: Iterator[T]) => it.aggregate(zeroValue)(cleanSeqOp, cleanCombOp)

var partiallyAggregated = mapPartitions(it => Iterator(aggregatePartition(it)))

var numPartitions = partiallyAggregated.partitions.length

val scale = math.max(math.ceil(math.pow(numPartitions, 1.0 / depth)).toInt, 2)

// If creating an extra level doesn't help reduce

// the wall-clock time, we stop tree aggregation.

// Don't trigger TreeAggregation when it doesn't save wall-clock time

while (numPartitions > scale + math.ceil(numPartitions.toDouble / scale)) {

numPartitions /= scale

val curNumPartitions = numPartitions

partiallyAggregated = partiallyAggregated.mapPartitionsWithIndex {

(i, iter) => iter.map((i % curNumPartitions, _))

}.reduceByKey(new HashPartitioner(curNumPartitions), cleanCombOp).values

}

partiallyAggregated.reduce(cleanCombOp)

}

从源码中可以看出，treeAggregate函数先是对每个分区利用scala的aggregate函数进行局部聚合的操作；同时，依据depth参数计算scale，如果当分区数量过多时，则按i%curNumPartitions进行key值计算，再按key进行重新分区合并计算；最后，在进行reduce聚合操作。这样可以通过调解深度来减少reduce的开销。

实例：List data = Arrays.asList(5, 1, 1, 4, 4, 2, 2);

JavaRDD javaRDD = javaSparkContext.parallelize(data,3);//转化操作JavaRDD javaRDD1 = javaRDD.map(new Function() {

@Override

public String call(Integer v1) throws Exception {

return Integer.toString(v1);

}

});String result1 = javaRDD1.treeAggregate("0", new Function2() {

@Override

public String call(String v1, String v2) throws Exception {

System.out.println(v1 + "=seq=" + v2);

return v1 + "=seq=" + v2;

}

}, new Function2() {

@Override

public String call(String v1, String v2) throws Exception {

System.out.println(v1 + "<=comb=>" + v2);

return v1 + "<=comb=>" + v2;

}

});

System.out.println(result1);

treeReduce

官方文档描述：Reduces the elements of this RDD in a multi-level tree pattern.

函数原型：def treeReduce(f: JFunction2[T, T, T], depth: Int): Tdef treeReduce(f: JFunction2[T, T, T]): T

与treeAggregate类似，只不过是seqOp和combOp相同的treeAggregate。

源码分析：def treeReduce(f: (T, T) => T, depth: Int = 2): T = withScope {

require(depth >= 1, s"Depth must be greater than or equal to 1 but got $depth.")

val cleanF = context.clean(f)

val reducePartition: Iterator[T] => Option[T] = iter => {

if (iter.hasNext) {

Some(iter.reduceLeft(cleanF))

} else {

None

}

val partiallyReduced = mapPartitions(it => Iterator(reducePartition(it)))

val op: (Option[T], Option[T]) => Option[T] = (c, x) => {

if (c.isDefined && x.isDefined) {

Some(cleanF(c.get, x.get))

} else if (c.isDefined) {

} else if (x.isDefined) {

} else {

None

}

partiallyReduced.treeAggregate(Option.empty[T])(op, op, depth)

.getOrElse(throw new UnsupportedOperationException("empty collection"))}

从源码中可以看出，treeReduce函数先是针对每个分区利用scala的reduceLeft函数进行计算；最后，在将局部合并的RDD进行treeAggregate计算，这里的seqOp和combOp一样，初值为空。在实际应用中，可以用treeReduce来代替reduce，主要是用于单个reduce操作开销比较大，而treeReduce可以通过调整深度来控制每次reduce的规模。

实例：List data = Arrays.asList(5, 1, 1, 4, 4, 2, 2);

JavaRDD javaRDD = javaSparkContext.parallelize(data,5);

JavaRDD javaRDD1 = javaRDD.map(new Function() {

@Override

public String call(Integer v1) throws Exception {

return Integer.toString(v1);

}

});String result = javaRDD1.treeReduce(new Function2() {

@Override

public String call(String v1, String v2) throws Exception {

System.out.println(v1 + "=" + v2);

return v1 + "=" + v2;

}

});

System.out.println("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~" + treeReduceRDD);

作者：小飞_侠_kobe

链接：https://www.jianshu.com/p/27222830d21a

腾讯云开发者社区

腾讯云面向开发者汇聚海量精品云计算使用和开发经验，营造开放的云计算技术生态圈。

更多推荐

Elasticsearch复杂数据类型终极指南：从入门到精通

Elasticsearch作为功能强大的搜索引擎，支持多种复杂数据类型，让开发者能够灵活处理各种结构化和非结构化数据。本文将带你全面了解Elasticsearch中的复杂数据类型，从基础概念到实际应用，助你轻松掌握数据建模的核心技巧。## 内部对象：构建层级化数据结构在Elasticsearch中，对象类型（Object）是最基础的复杂数据类型之一，用于表示具有嵌套关系的数据。例如，我们可

腾讯云开发者社区

终极指南：Flink SQL连接器版本管理从混乱到有序的升级之路

Apache Flink作为流处理领域的佼佼者，其SQL连接器的版本管理一直是开发者面临的核心挑战。本文将系统讲解Flink SQL连接器版本管理的最佳实践，帮助你轻松应对版本兼容性问题，实现从混乱到有序的升级之旅。## 连接器版本管理的常见痛点 😫在Flink应用开发中，连接器版本管理常常让开发者头疼不已。不同版本的连接器可能导致各种兼容性问题，例如API变更、功能差异甚至运行时错误。

腾讯云开发者社区

如何快速搭建Neon无服务器PostgreSQL：面向初学者的完整指南

Neon是一款革命性的无服务器PostgreSQL解决方案，它通过分离存储和计算层，实现了自动扩缩容、类代码式数据库分支以及零级扩展能力。本指南将帮助你从零开始搭建Neon开发环境，体验这款创新数据库的强大功能。## 准备工作：环境要求与依赖项在开始搭建Neon环境前，请确保你的系统满足以下要求：- Linux操作系统（推荐Ubuntu 20.04+或Debian 11+）- Git