Spark sql 版本2.3.0,grouping sets之后jion相同表会出现列名不存在的问题。会导致报错

org.apache.spark.sql.AnalysisException: cannot resolve '`a.a`' given input columns: [b.b, b.c, a.a, a.b, a.c]; line 8 pos 29;

或者

org.apache.spark.sql.AnalysisException:
Failure when resolving conflicting references in Join:

问题复现

先造一点数据

spark.sql("set spark.sql.crossJoin.enabled=true")

spark.sql("""
cache table t1
SELECT inline(array(struct(
                    date_add('2021-08-04', -0),
                    date_add('2021-08-04', -7),
                    date_add('2021-08-04', -30)
                ),
                struct(
                    date_add('2021-08-04', -1),
                    date_add('2021-08-04', -8),
                    date_add('2021-08-04', -31)
                ),
                struct(
                    date_add('2021-08-04', -2),
                    date_add('2021-08-04', -9),
                    date_add('2021-08-04', -32)
                ),
                struct(
                    date_add('2021-08-04', -3),
                    date_add('2021-08-04', -10),
                    date_add('2021-08-04', -33)
                ),
                struct(
                    date_add('2021-08-04', -4),
                    date_add('2021-08-04', -11),
                    date_add('2021-08-04', -34)
                ),
                struct(
                    date_add('2021-08-04', -5),
                    date_add('2021-08-04', -12),
                    date_add('2021-08-04', -35)
                ))) as (a, b,c)""").show(20, false);

然后使用grouping sets,并自己join自己

spark.sql("""
with t2 as ( 
select a, count(b) as b, count(distinct c) as c
from t1
group by a grouping sets ((a),())
)
select * 
from t2 as a join t2 as b on a.a=b.a
""").show()

报错

org.apache.spark.sql.AnalysisException: cannot resolve '`a.a`' given input columns: [a.b, b.b, b.c, a.a, a.c]; line 8 pos 29;

当join没有条件的话,会报第二种错

spark.sql("""
with t2 as ( 
select a, count(b) as b, count(distinct c) as c
from t1
group by a grouping sets ((a),())
)
select * 
from t2 as a join t2 as b
""").show()

报错

org.apache.spark.sql.AnalysisException:
Failure when resolving conflicting references in Join:

原因

猜测:grouping sets使用expand产生结果时不会生成新的属性名。
具体可见
https://github.com/apache/spark/pull/26441

解决方法

上述问题在spark版本2.4.5, 3.0.0修复了。
也可以通过手动给所有列增加别名做到。

spark.sql("""
with t2 as ( 
select a as a, count(b) as b, count(distinct c) as c
from t1
group by a grouping sets ((a),())
)
select * 
from t2 as a join t2 as b on a.a=b.a
""").show()

注意,a后面加了 as a。这样就能正确运行了

+----------+---+---+----------+---+---+
|         a|  b|  c|         a|  b|  c|
+----------+---+---+----------+---+---+
|2021-07-30|  1|  1|2021-07-30|  1|  1|
|2021-08-01|  1|  1|2021-08-01|  1|  1|
|2021-08-03|  1|  1|2021-08-03|  1|  1|
|2021-07-31|  1|  1|2021-07-31|  1|  1|
|2021-08-02|  1|  1|2021-08-02|  1|  1|
|2021-08-04|  1|  1|2021-08-04|  1|  1|
+----------+---+---+----------+---+---+
Logo

腾讯云面向开发者汇聚海量精品云计算使用和开发经验,营造开放的云计算技术生态圈。

更多推荐