hive sql和spark sql查询同一条SQL语句,结果不一样(甚至条数差别很大)的原因
hive sql和spark sql查询同一条SQL语句,结果不一样(甚至条数差别很大)的原因哪位老师帮忙看一下同一句sql,hive sql 查出来 160w ,但是spark sql (spark-shell或者spark.sql()) 查出来只有150w0、结论:可以看下SQL 中 join on 两边的字段类型是否一致我之前遇到这样的问题,发现on 条件 等号两边的字段类型不一致造成的当
·
hive sql和spark sql查询同一条SQL语句,结果不一样(甚至条数差别很大)的原因
最近,好几位小伙伴遇到了,将同一句sql,分别在hive sql 查出来 160w ,但是在spark sql (spark-shell或者spark.sql()) 查出来只有150w,笔者之前也遇到类似的问题,遂总结形成此文,备忘!
0、结论:可以看下SQL 中 join on 两边的字段类型是否一致
我之前遇到这样的问题,发现on 条件 等号两边的字段类型不一致造成的
- 当 on 条件两边字段类型,一边是string,一边是bigint; 在hive中 会 把 string,bigint 都转成 double,结果错误
- 当 on 条件两边字段类型,一边是string,一边是decimal;spark中: 会 把 string,decimal 都转成了 double,结果错误
1、数据准备
user.csv
```
1,a1,1,1,1
2,a2,1111111111111111112,1111111111111111112,1111111111111111112
3,a3,1111111111111111113,1111111111111111113,1111111111111111113
4,a4,1111111111111111114,1111111111111111114,1111111111111111114
5,a5,1111111111111111115,1111111111111111115,1111111111111111115
```
create database new;
```
spark.read.format("csv").schema("id int, name string, user_id_bigint bigint, user_id_decimal decimal(20,0), user_id_string string").load("file:///home/work/workdir/test_spark_hive_sql_diff/user.csv").toDF.write.mode("overwrite").saveAsTable("new.test_type_user")
spark.sql("select * from new.test_type_user").show
+---+----+-------------------+-------------------+-------------------+
| id|name| user_id_bigint| user_id_decimal| user_id_string|
+---+----+-------------------+-------------------+-------------------+
| 1| a1| 1| 1| 1|
| 2| a2|1111111111111111112|1111111111111111112|1111111111111111112|
| 3| a3|1111111111111111113|1111111111111111113|1111111111111111113|
| 4| a4|1111111111111111114|1111111111111111114|1111111111111111114|
| 5| a5|1111111111111111115|1111111111111111115|1111111111111111115|
+---+----+-------------------+-------------------+-------------------+
spark.sql("desc new.test_type_user").show
+---------------+-------------+-------+
| col_name| data_type|comment|
+---------------+-------------+-------+
| id| int| null|
| name| string| null|
| user_id_bigint| bigint| null|
|user_id_decimal|decimal(20,0)| null|
| user_id_string| string| null|
+---------------+-------------+-------+
```
user-info.csv
```
1,1,1,xiaoming
1111111111111111112,1111111111111111112,1111111111111111112,k2data
1111111111111111113,1111111111111111113,1111111111111111113,xiaoming
1111111111111111114,1111111111111111114,1111111111111111114,k2data
```
```
spark.read.format("csv").schema("user_info_id_bigint bigint, user_info_id_decimal decimal(20,0), user_info_id_string string, group string").load("file:///home/work/workdir/test_spark_hive_sql_diff/user-info.csv").toDF.write.mode("overwrite").saveAsTable("new.test_type_user_info")
spark.sql("select * from new.test_type_user_info").show
+-------------------+--------------------+-------------------+------+
|user_info_id_bigint|user_info_id_decimal|user_info_id_string| group|
+-------------------+--------------------+-------------------+------+
| 1| 1| 1|xiaoming|
|1111111111111111112| 1111111111111111112|1111111111111111112|k2data|
|1111111111111111113| 1111111111111111113|1111111111111111113|xiaoming|
|1111111111111111114| 1111111111111111114|1111111111111111114|k2data|
+-------------------+--------------------+-------------------+------+
spark.sql("desc new.test_type_user_info").show
+--------------------+-------------+-------+
| col_name| data_type|comment|
+--------------------+-------------+-------+
| user_info_id_bigint| bigint| null|
|user_info_id_decimal|decimal(20,0)| null|
| user_info_id_string| string| null|
| group| string| null|
+--------------------+-------------+-------+
```
2、当 on 条件两边字段类型都一致的情况下,毫无疑问,在hive和spark中都是正确的
bigint = bigint 正确结果
```
select id,name, user_id_bigint,user_info_id_bigint from new.test_type_user u left join new.test_type_user_info ui on u.user_id_bigint = ui.user_info_id_bigint
# spark
spark.sql("select id,name, user_id_bigint,user_info_id_bigint from new.test_type_user u left join new.test_type_user_info ui on u.user_id_bigint = ui.user_info_id_bigint").show
+---+----+-------------------+-------------------+
| id|name| user_id_bigint|user_info_id_bigint|
+---+----+-------------------+-------------------+
| 1| a1| 1| 1|
| 2| a2|1111111111111111112|1111111111111111112|
| 3| a3|1111111111111111113|1111111111111111113|
| 4| a4|1111111111111111114|1111111111111111114|
| 5| a5|1111111111111111115| null|
+---+----+-------------------+-------------------+
spark.sql("select id,name, user_id_bigint,user_info_id_bigint from new.test_type_user u left join new.test_type_user_info ui on u.user_id_bigint = ui.user_info_id_bigint").explain
== Physical Plan ==
*BroadcastHashJoin [user_id_bigint#27L], [user_info_id_bigint#89L], LeftOuter, BuildRight
:- *FileScan parquet new.test_type_user[id#25,name#26,user_id_bigint#27L] Batched: false, Format: Parquet, Location: InMemoryFileIndex[hdfs://xxx/warehouse/new.db/test_type_user], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int,name:string,user_id_bigint:bigint>
+- *BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true]))
+- *FileScan parquet new.test_type_user_info[user_info_id_bigint#89L] Batched: false, Format: Parquet, Location: InMemoryFileIndex[hdfs://xxx/warehouse/new.db/test_type_user_info], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<user_info_id_bigint:bigint>
# hive
select id,name, user_id_bigint,user_info_id_bigint from new.test_type_user u left join new.test_type_user_info ui on u.user_id_bigint = ui.user_info_id_bigint;
+-----+-------+----------------------+----------------------+--+
| id | name | user_id_bigint | user_info_id_bigint |
+-----+-------+----------------------+----------------------+--+
| 1 | a1 | 1 | 1 |
| 2 | a2 | 1111111111111111112 | 1111111111111111112 |
| 3 | a3 | 1111111111111111113 | 1111111111111111113 |
| 4 | a4 | 1111111111111111114 | 1111111111111111114 |
| 5 | a5 | 1111111111111111115 | NULL |
+-----+-------+----------------------+----------------------+--+
explain select id,name, user_id_bigint,user_info_id_bigint from new.test_type_user u left join new.test_type_user_info ui on u.user_id_bigint = ui.user_info_id_bigint;
+-------------------------------------------------------------------------------------------------------------------+--+
| Explain |
+-------------------------------------------------------------------------------------------------------------------+--+
| STAGE DEPENDENCIES: |
| Stage-4 is a root stage |
| Stage-3 depends on stages: Stage-4 |
| Stage-0 depends on stages: Stage-3 |
| |
| STAGE PLANS: |
| Stage: Stage-4 |
| Map Reduce Local Work |
| Alias -> Map Local Tables: |
| ui |
| Fetch Operator |
| limit: -1 |
| Alias -> Map Local Operator Tree: |
| ui |
| TableScan |
| alias: ui |
| Statistics: Num rows: 156 Data size: 1253 Basic stats: COMPLETE Column stats: NONE |
| HashTable Sink Operator |
| keys: |
| 0 user_id_bigint (type: bigint) |
| 1 user_info_id_bigint (type: bigint) |
| |
| Stage: Stage-3 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: u |
| Statistics: Num rows: 12 Data size: 1365 Basic stats: COMPLETE Column stats: NONE |
| Map Join Operator |
| condition map: |
| Left Outer Join0 to 1 |
| keys: |
| 0 user_id_bigint (type: bigint) |
| 1 user_info_id_bigint (type: bigint) |
| outputColumnNames: _col0, _col1, _col2, _col8 |
| Statistics: Num rows: 171 Data size: 1378 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: _col0 (type: int), _col1 (type: string), _col2 (type: bigint), _col8 (type: bigint) |
| outputColumnNames: _col0, _col1, _col2, _col3 |
| Statistics: Num rows: 171 Data size: 1378 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 171 Data size: 1378 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.TextInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| Local Work: |
| Map Reduce Local Work |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
+-------------------------------------------------------------------------------------------------------------------+--+
2、类似可知:
同样 decimal = decimal 正确结果
同样 string = string 正确结果
select id,name, user_id_decimal,user_info_id_decimal from new.test_type_user u left join new.test_type_user_info ui on u.user_id_decimal = ui.user_info_id_decimal
select id,name, user_id_string,user_info_id_string from new.test_type_user u left join new.test_type_user_info ui on u.user_id_string = ui.user_info_id_string
3、当 on 条件两边字段类型不一致的情况下,需要看类型转换方向决定,判断结果是否正确
3.1、当 on 条件两边字段类型,一边是decimal,一边是bigint
无论是
bigint=decimal
decimal=bigin
都转成 decimal 正确
1)、bigint=decimal 正确
spark:
select id,name, user_id_bigint,user_info_id_decimal from new.test_type_user u left join new.test_type_user_info ui on u.user_id_bigint = ui.user_info_id_decimal;
spark.sql("select id,name, user_id_bigint,user_info_id_decimal from new.test_type_user u left join new.test_type_user_info ui on u.user_id_bigint = ui.user_info_id_decimal").show()
+---+----+-------------------+--------------------+
| id|name| user_id_bigint|user_info_id_decimal|
+---+----+-------------------+--------------------+
| 1| a1| 1| 1|
| 2| a2|1111111111111111112| 1111111111111111112|
| 3| a3|1111111111111111113| 1111111111111111113|
| 4| a4|1111111111111111114| 1111111111111111114|
| 5| a5|1111111111111111115| null|
+---+----+-------------------+--------------------+
spark.sql("select id,name, user_id_bigint,user_info_id_decimal from new.test_type_user u left join new.test_type_user_info ui on u.user_id_bigint = ui.user_info_id_decimal").explain
== Physical Plan ==
*BroadcastHashJoin [cast(user_id_bigint#27L as decimal(20,0))], [user_info_id_decimal#90], LeftOuter, BuildRight
:- *FileScan parquet new.test_type_user[id#25,name#26,user_id_bigint#27L] Batched: false, Format: Parquet, Location: InMemoryFileIndex[hdfs://xxx/warehouse/new.db/test_type_user], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int,name:string,user_id_bigint:bigint>
+- *BroadcastExchange HashedRelationBroadcastMode(List(input[0, decimal(20,0), true]))
+- *FileScan parquet new.test_type_user_info[user_info_id_decimal#90] Batched: false, Format: Parquet, Location: InMemoryFileIndex[hdfs://xxx/warehouse/new.db/test_type_user_info], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<user_info_id_decimal:decimal(20,0)>
hive:
select id,name, user_id_bigint,user_info_id_decimal from new.test_type_user u left join new.test_type_user_info ui on u.user_id_bigint = ui.user_info_id_decimal;
+-----+-------+----------------------+-----------------------+--+
| id | name | user_id_bigint | user_info_id_decimal |
+-----+-------+----------------------+-----------------------+--+
| 1 | a1 | 1 | 1 |
| 2 | a2 | 1111111111111111112 | 1111111111111111112 |
| 3 | a3 | 1111111111111111113 | 1111111111111111113 |
| 4 | a4 | 1111111111111111114 | 1111111111111111114 |
| 5 | a5 | 1111111111111111115 | NULL |
+-----+-------+----------------------+-----------------------+--+
explain select id,name, user_id_bigint,user_info_id_decimal from new.test_type_user u left join new.test_type_user_info ui on u.user_id_bigint = ui.user_info_id_decimal;
+--------------------------------------------------------------------------------------------------------------------------+--+
| Explain |
+--------------------------------------------------------------------------------------------------------------------------+--+
| STAGE DEPENDENCIES: |
| Stage-4 is a root stage |
| Stage-3 depends on stages: Stage-4 |
| Stage-0 depends on stages: Stage-3 |
| |
| STAGE PLANS: |
| Stage: Stage-4 |
| Map Reduce Local Work |
| Alias -> Map Local Tables: |
| ui |
| Fetch Operator |
| limit: -1 |
| Alias -> Map Local Operator Tree: |
| ui |
| TableScan |
| alias: ui |
| Statistics: Num rows: 11 Data size: 1253 Basic stats: COMPLETE Column stats: NONE |
| HashTable Sink Operator |
| keys: |
| 0 CAST( user_id_bigint AS decimal(20,0)) (type: decimal(20,0)) |
| 1 user_info_id_decimal (type: decimal(20,0)) |
| |
| Stage: Stage-3 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: u |
| Statistics: Num rows: 12 Data size: 1365 Basic stats: COMPLETE Column stats: NONE |
| Map Join Operator |
| condition map: |
| Left Outer Join0 to 1 |
| keys: |
| 0 CAST( user_id_bigint AS decimal(20,0)) (type: decimal(20,0)) |
| 1 user_info_id_decimal (type: decimal(20,0)) |
| outputColumnNames: _col0, _col1, _col2, _col9 |
| Statistics: Num rows: 13 Data size: 1501 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: _col0 (type: int), _col1 (type: string), _col2 (type: bigint), _col9 (type: decimal(20,0)) |
| outputColumnNames: _col0, _col1, _col2, _col3 |
| Statistics: Num rows: 13 Data size: 1501 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 13 Data size: 1501 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.TextInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| Local Work: |
| Map Reduce Local Work |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
+--------------------------------------------------------------------------------------------------------------------------+--+
2)、decimal=bigint 正确
select id,name, user_id_decimal,user_info_id_bigint from new.test_type_user u left join new.test_type_user_info ui on u.user_id_decimal = ui.user_info_id_bigint;
+-----+-------+----------------------+----------------------+--+
| id | name | user_id_decimal | user_info_id_bigint |
+-----+-------+----------------------+----------------------+--+
| 1 | a1 | 1 | 1 |
| 2 | a2 | 1111111111111111112 | 1111111111111111112 |
| 3 | a3 | 1111111111111111113 | 1111111111111111113 |
| 4 | a4 | 1111111111111111114 | 1111111111111111114 |
| 5 | a5 | 1111111111111111115 | NULL |
+-----+-------+----------------------+----------------------+--+
explain select id,name, user_id_decimal,user_info_id_bigint from new.test_type_user u left join new.test_type_user_info ui on u.user_id_decimal = ui.user_info_id_bigint;
+--------------------------------------------------------------------------------------------------------------------------+--+
| Explain |
+--------------------------------------------------------------------------------------------------------------------------+--+
| STAGE DEPENDENCIES: |
| Stage-4 is a root stage |
| Stage-3 depends on stages: Stage-4 |
| Stage-0 depends on stages: Stage-3 |
| |
| STAGE PLANS: |
| Stage: Stage-4 |
| Map Reduce Local Work |
| Alias -> Map Local Tables: |
| ui |
| Fetch Operator |
| limit: -1 |
| Alias -> Map Local Operator Tree: |
| ui |
| TableScan |
| alias: ui |
| Statistics: Num rows: 156 Data size: 1253 Basic stats: COMPLETE Column stats: NONE |
| HashTable Sink Operator |
| keys: |
| 0 user_id_decimal (type: decimal(20,0)) |
| 1 CAST( user_info_id_bigint AS decimal(20,0)) (type: decimal(20,0)) |
| |
| Stage: Stage-3 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: u |
| Statistics: Num rows: 6 Data size: 1365 Basic stats: COMPLETE Column stats: NONE |
| Map Join Operator |
| condition map: |
| Left Outer Join0 to 1 |
| keys: |
| 0 user_id_decimal (type: decimal(20,0)) |
| 1 CAST( user_info_id_bigint AS decimal(20,0)) (type: decimal(20,0)) |
| outputColumnNames: _col0, _col1, _col3, _col8 |
| Statistics: Num rows: 171 Data size: 1378 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: _col0 (type: int), _col1 (type: string), _col3 (type: decimal(20,0)), _col8 (type: bigint) |
| outputColumnNames: _col0, _col1, _col2, _col3 |
| Statistics: Num rows: 171 Data size: 1378 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 171 Data size: 1378 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.TextInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| Local Work: |
| Map Reduce Local Work |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
+--------------------------------------------------------------------------------------------------------------------------+--+
56 rows selected (0.262 seconds)
spark.sql("select id,name, user_id_decimal,user_info_id_bigint from new.test_type_user u left join new.test_type_user_info ui on u.user_id_decimal = ui.user_info_id_bigint").explain
== Physical Plan ==
*BroadcastHashJoin [user_id_decimal#28], [cast(user_info_id_bigint#89L as decimal(20,0))], LeftOuter, BuildRight
:- *FileScan parquet new.test_type_user[id#25,name#26,user_id_decimal#28] Batched: false, Format: Parquet, Location: InMemoryFileIndex[hdfs://xxx/warehouse/new.db/test_type_user], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int,name:string,user_id_decimal:decimal(20,0)>
+- *BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, bigint, true] as decimal(20,0))))
+- *FileScan parquet new.test_type_user_info[user_info_id_bigint#89L] Batched: false, Format: Parquet, Location: InMemoryFileIndex[hdfs://xxx/warehouse/new.db/test_type_user_info], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<user_info_id_bigint:bigint>
spark.sql("select id,name, user_id_decimal,user_info_id_bigint from new.test_type_user u left join new.test_type_user_info ui on u.user_id_decimal = ui.user_info_id_bigint").show()
+---+----+-------------------+-------------------+
| id|name| user_id_decimal|user_info_id_bigint|
+---+----+-------------------+-------------------+
| 1| a1| 1| 1|
| 2| a2|1111111111111111112|1111111111111111112|
| 3| a3|1111111111111111113|1111111111111111113|
| 4| a4|1111111111111111114|1111111111111111114|
| 5| a5|1111111111111111115| null|
+---+----+-------------------+-------------------+
3.2、当 on 条件两边字段类型,一边是string,一边是bigint
spark-sql: 把 string 转成 bigint,结果正确
spark-shell: 把 string 转成 bigint,结果正确
hive: 把 string,bigint 都转成 double,结果错误
select id,name, user_id_string,user_info_id_bigint from new.test_type_user u left join new.test_type_user_info ui on u.user_id_string = ui.user_info_id_bigint
spark.sql("select id,name, user_id_string,user_info_id_bigint from new.test_type_user u left join new.test_type_user_info ui on u.user_id_string = ui.user_info_id_bigint").show()
spark.sql("select id,name, user_id_string,user_info_id_bigint from new.test_type_user u left join new.test_type_user_info ui on u.user_id_string = ui.user_info_id_bigint").explain
== Physical Plan ==
*BroadcastHashJoin [cast(user_id_string#29 as bigint)], [user_info_id_bigint#89L], LeftOuter, BuildRight
:- *FileScan parquet new.test_type_user[id#25,name#26,user_id_string#29] Batched: false, Format: Parquet, Location: InMemoryFileIndex[hdfs://xxx/warehouse/new.db/test_type_user], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int,name:string,user_id_string:string>
+- *BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true]))
+- *FileScan parquet new.test_type_user_info[user_info_id_bigint#89L] Batched: false, Format: Parquet, Location: InMemoryFileIndex[hdfs://xxx/warehouse/new.db/test_type_user_info], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<user_info_id_bigint:bigint>
SELECT LENGTH(1111111111111111113),length(111111111111111111);
+------+------+--+
| _c0 | _c1 |
+------+------+--+
| 19 | 18 |
+------+------+--+
## spark 结果正确
+---+----+-------------------+-------------------+
| id|name| user_id_string|user_info_id_bigint|
+---+----+-------------------+-------------------+
| 1| a1| 1| 1|
| 2| a2|1111111111111111112|1111111111111111112|
| 3| a3|1111111111111111113|1111111111111111113|
| 4| a4|1111111111111111114|1111111111111111114|
| 5| a5|1111111111111111115| null|
+---+----+-------------------+-------------------+
#hive 结果不正确
select id,name, user_id_string,user_info_id_bigint from new.test_type_user u left join new.test_type_user_info ui on u.user_id_string = ui.user_info_id_bigint;
+-----+-------+----------------------+----------------------+--+
| id | name | user_id_string | user_info_id_bigint |
+-----+-------+----------------------+----------------------+--+
| 1 | a1 | 1 | 1 |
| 2 | a2 | 1111111111111111112 | 1111111111111111112 |
| 2 | a2 | 1111111111111111112 | 1111111111111111113 |
| 2 | a2 | 1111111111111111112 | 1111111111111111114 |
| 3 | a3 | 1111111111111111113 | 1111111111111111112 |
| 3 | a3 | 1111111111111111113 | 1111111111111111113 |
| 3 | a3 | 1111111111111111113 | 1111111111111111114 |
| 4 | a4 | 1111111111111111114 | 1111111111111111112 |
| 4 | a4 | 1111111111111111114 | 1111111111111111113 |
| 4 | a4 | 1111111111111111114 | 1111111111111111114 |
| 5 | a5 | 1111111111111111115 | 1111111111111111112 |
| 5 | a5 | 1111111111111111115 | 1111111111111111113 |
| 5 | a5 | 1111111111111111115 | 1111111111111111114 |
+-----+-------+----------------------+----------------------+--+
explain select id,name, user_id_string,user_info_id_bigint from new.test_type_user u left join new.test_type_user_info ui on u.user_id_string = ui.user_info_id_bigint;
+-------------------------------------------------------------------------------------------------------------------+--+
| Explain |
+-------------------------------------------------------------------------------------------------------------------+--+
| STAGE DEPENDENCIES: |
| Stage-4 is a root stage |
| Stage-3 depends on stages: Stage-4 |
| Stage-0 depends on stages: Stage-3 |
| |
| STAGE PLANS: |
| Stage: Stage-4 |
| Map Reduce Local Work |
| Alias -> Map Local Tables: |
| ui |
| Fetch Operator |
| limit: -1 |
| Alias -> Map Local Operator Tree: |
| ui |
| TableScan |
| alias: ui |
| Statistics: Num rows: 156 Data size: 1253 Basic stats: COMPLETE Column stats: NONE |
| HashTable Sink Operator |
| keys: |
| 0 UDFToDouble(user_id_string) (type: double) |
| 1 UDFToDouble(user_info_id_bigint) (type: double) |
| |
| Stage: Stage-3 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: u |
| Statistics: Num rows: 6 Data size: 1365 Basic stats: COMPLETE Column stats: NONE |
| Map Join Operator |
| condition map: |
| Left Outer Join0 to 1 |
| keys: |
| 0 UDFToDouble(user_id_string) (type: double) |
| 1 UDFToDouble(user_info_id_bigint) (type: double) |
| outputColumnNames: _col0, _col1, _col4, _col8 |
| Statistics: Num rows: 171 Data size: 1378 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: _col0 (type: int), _col1 (type: string), _col4 (type: string), _col8 (type: bigint) |
| outputColumnNames: _col0, _col1, _col2, _col3 |
| Statistics: Num rows: 171 Data size: 1378 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 171 Data size: 1378 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.TextInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| Local Work: |
| Map Reduce Local Work |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
+-------------------------------------------------------------------------------------------------------------------+--+
3.3、当 on 条件两边字段类型,一边是string,一边是decimal
spark-shell: string,decimal 都转成了 double,结果错误
spark-sql: string,decimal 都转成了 double,结果错误
hive: string,decimal 都转成了 decimal(38,18), 结果正确
spark.sql("select id,name, user_id_string,user_info_id_decimal from new.test_type_user u left join new.test_type_user_info ui on u.user_id_string = ui.user_info_id_decimal").explain
== Physical Plan ==
*BroadcastHashJoin [cast(user_id_string#29 as double)], [cast(user_info_id_decimal#90 as double)], LeftOuter, BuildRight
:- *FileScan parquet new.test_type_user[id#25,name#26,user_id_string#29] Batched: false, Format: Parquet, Location: InMemoryFileIndex[hdfs://xxx/warehouse/new.db/test_type_user], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int,name:string,user_id_string:string>
+- *BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, decimal(20,0), true] as double)))
+- *FileScan parquet new.test_type_user_info[user_info_id_decimal#90] Batched: false, Format: Parquet, Location: InMemoryFileIndex[hdfs://xxx/warehouse/new.db/test_type_user_info], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<user_info_id_decimal:decimal(20,0)>
spark.sql("select id,name, user_id_string,user_info_id_decimal from new.test_type_user u left join new.test_type_user_info ui on u.user_id_string = ui.user_info_id_decimal").show()
explain select id,name, user_id_string,user_info_id_decimal from new.test_type_user u left join new.test_type_user_info ui on u.user_id_string = ui.user_info_id_decimal;
string = decimal
+--------------------------------------------------------------------------------------------------------------------------+--+
| Explain |
+--------------------------------------------------------------------------------------------------------------------------+--+
| STAGE DEPENDENCIES: |
| Stage-4 is a root stage |
| Stage-3 depends on stages: Stage-4 |
| Stage-0 depends on stages: Stage-3 |
| |
| STAGE PLANS: |
| Stage: Stage-4 |
| Map Reduce Local Work |
| Alias -> Map Local Tables: |
| ui |
| Fetch Operator |
| limit: -1 |
| Alias -> Map Local Operator Tree: |
| ui |
| TableScan |
| alias: ui |
| Statistics: Num rows: 11 Data size: 1253 Basic stats: COMPLETE Column stats: NONE |
| HashTable Sink Operator |
| keys: |
| 0 CAST( user_id_string AS decimal(38,18)) (type: decimal(38,18)) |
| 1 user_info_id_decimal (type: decimal(38,18)) |
| |
| Stage: Stage-3 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: u |
| Statistics: Num rows: 6 Data size: 1365 Basic stats: COMPLETE Column stats: NONE |
| Map Join Operator |
| condition map: |
| Left Outer Join0 to 1 |
| keys: |
| 0 CAST( user_id_string AS decimal(38,18)) (type: decimal(38,18)) |
| 1 user_info_id_decimal (type: decimal(38,18)) |
| outputColumnNames: _col0, _col1, _col4, _col9 |
| Statistics: Num rows: 12 Data size: 1378 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: _col0 (type: int), _col1 (type: string), _col4 (type: string), _col9 (type: decimal(20,0)) |
| outputColumnNames: _col0, _col1, _col2, _col3 |
| Statistics: Num rows: 12 Data size: 1378 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 12 Data size: 1378 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.TextInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| Local Work: |
| Map Reduce Local Work |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
+--------------------------------------------------------------------------------------------------------------------------+--+
## spark-shell 结果
+---+----+-------------------+--------------------+
| id|name| user_id_string|user_info_id_decimal|
+---+----+-------------------+--------------------+
| 1| a1| 1| 1|
| 2| a2|1111111111111111112| 1111111111111111114|
| 2| a2|1111111111111111112| 1111111111111111113|
| 2| a2|1111111111111111112| 1111111111111111112|
| 3| a3|1111111111111111113| 1111111111111111114|
| 3| a3|1111111111111111113| 1111111111111111113|
| 3| a3|1111111111111111113| 1111111111111111112|
| 4| a4|1111111111111111114| 1111111111111111114|
| 4| a4|1111111111111111114| 1111111111111111113|
| 4| a4|1111111111111111114| 1111111111111111112|
| 5| a5|1111111111111111115| 1111111111111111114|
| 5| a5|1111111111111111115| 1111111111111111113|
| 5| a5|1111111111111111115| 1111111111111111112|
+---+----+-------------------+--------------------+
## spark-sql 结果
1 a1 1 1
2 a2 1111111111111111112 1111111111111111114
2 a2 1111111111111111112 1111111111111111113
2 a2 1111111111111111112 1111111111111111112
3 a3 1111111111111111113 1111111111111111114
3 a3 1111111111111111113 1111111111111111113
3 a3 1111111111111111113 1111111111111111112
4 a4 1111111111111111114 1111111111111111114
4 a4 1111111111111111114 1111111111111111113
4 a4 1111111111111111114 1111111111111111112
5 a5 1111111111111111115 1111111111111111114
5 a5 1111111111111111115 1111111111111111113
5 a5 1111111111111111115 1111111111111111112
Time taken: 0.401 seconds, Fetched 13 row(s)
## hive 结果
+-----+-------+----------------------+-----------------------+--+
| id | name | user_id_string | user_info_id_decimal |
+-----+-------+----------------------+-----------------------+--+
| 1 | a1 | 1 | 1 |
| 2 | a2 | 1111111111111111112 | 1111111111111111112 |
| 3 | a3 | 1111111111111111113 | 1111111111111111113 |
| 4 | a4 | 1111111111111111114 | 1111111111111111114 |
| 5 | a5 | 1111111111111111115 | NULL |
+-----+-------+----------------------+-----------------------+--+
更多推荐
已为社区贡献3条内容
所有评论(0)