Spark sql org.apache.spark.sql.AnalysisException: cannot resolve ‘`a.a`‘ given input columns
Spark sql 版本2.3.0,grouping sets之后jion相同表会出现列名不存在的问题。会导致报错org.apache.spark.sql.AnalysisException: cannot resolve '`a.a`' given input columns: [b.b, b.c, a.a, a.b, a.c]; line 8 pos 29;或者org.apache.spark
·
Spark sql 版本2.3.0,grouping sets之后jion相同表会出现列名不存在的问题。会导致报错
org.apache.spark.sql.AnalysisException: cannot resolve '`a.a`' given input columns: [b.b, b.c, a.a, a.b, a.c]; line 8 pos 29;
或者
org.apache.spark.sql.AnalysisException:
Failure when resolving conflicting references in Join:
问题复现
先造一点数据
spark.sql("set spark.sql.crossJoin.enabled=true")
spark.sql("""
cache table t1
SELECT inline(array(struct(
date_add('2021-08-04', -0),
date_add('2021-08-04', -7),
date_add('2021-08-04', -30)
),
struct(
date_add('2021-08-04', -1),
date_add('2021-08-04', -8),
date_add('2021-08-04', -31)
),
struct(
date_add('2021-08-04', -2),
date_add('2021-08-04', -9),
date_add('2021-08-04', -32)
),
struct(
date_add('2021-08-04', -3),
date_add('2021-08-04', -10),
date_add('2021-08-04', -33)
),
struct(
date_add('2021-08-04', -4),
date_add('2021-08-04', -11),
date_add('2021-08-04', -34)
),
struct(
date_add('2021-08-04', -5),
date_add('2021-08-04', -12),
date_add('2021-08-04', -35)
))) as (a, b,c)""").show(20, false);
然后使用grouping sets,并自己join自己
spark.sql("""
with t2 as (
select a, count(b) as b, count(distinct c) as c
from t1
group by a grouping sets ((a),())
)
select *
from t2 as a join t2 as b on a.a=b.a
""").show()
报错
org.apache.spark.sql.AnalysisException: cannot resolve '`a.a`' given input columns: [a.b, b.b, b.c, a.a, a.c]; line 8 pos 29;
当join没有条件的话,会报第二种错
spark.sql("""
with t2 as (
select a, count(b) as b, count(distinct c) as c
from t1
group by a grouping sets ((a),())
)
select *
from t2 as a join t2 as b
""").show()
报错
org.apache.spark.sql.AnalysisException:
Failure when resolving conflicting references in Join:
原因
猜测:grouping sets使用expand产生结果时不会生成新的属性名。
具体可见
https://github.com/apache/spark/pull/26441
解决方法
上述问题在spark版本2.4.5, 3.0.0修复了。
也可以通过手动给所有列增加别名做到。
spark.sql("""
with t2 as (
select a as a, count(b) as b, count(distinct c) as c
from t1
group by a grouping sets ((a),())
)
select *
from t2 as a join t2 as b on a.a=b.a
""").show()
注意,a后面加了 as a。这样就能正确运行了
+----------+---+---+----------+---+---+
| a| b| c| a| b| c|
+----------+---+---+----------+---+---+
|2021-07-30| 1| 1|2021-07-30| 1| 1|
|2021-08-01| 1| 1|2021-08-01| 1| 1|
|2021-08-03| 1| 1|2021-08-03| 1| 1|
|2021-07-31| 1| 1|2021-07-31| 1| 1|
|2021-08-02| 1| 1|2021-08-02| 1| 1|
|2021-08-04| 1| 1|2021-08-04| 1| 1|
+----------+---+---+----------+---+---+
更多推荐
已为社区贡献1条内容
所有评论(0)