Spark-SQL的7种join使用

Spark-SQL的7种Join

1. 环境信息

如下的代码实在Zeppelin0.7版本下的实现的。
Zeppelin是一个值得推荐的代码编写和文档写作的工具。
提供良好的交互式编程体验,类似于Jupyter Notebook或者
IPython Notebook。

1
2
3
4
5
6
7
8
9
10
11
%pyspark

import sys

print(sys.version)
print(spark.version)
print(sc.version)

print(spark)
print(sc)
print(sqlc)

2. 其中Join操作

  • cross join (笛卡尔积)
  • [inner] join
  • left[outer] join
  • right [outer] join
  • full [outer] join
  • left semi join (in)
  • exists/not exists

3. 查看数据

1
2
3
4
5
6
7
8
9
%sh
cd /Users/renewjoy/DataScience/LearnNotes/fulldata/
ls
cat age.csv
cat.addr.csv
name,age
renewjoy,25
oyea91e,28
joy,30

读取数据转换为DataFrame

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
%pyspark
bpath = '/Users/renewjoy/DataScience/LearnNotes/fulldata/'

age = spark.read.csv(bpath+'age.csv', header=True)
addr = spark.read.csv(bpath+'addr.csv', header=True)

age.show()
addr.show()

age.registerTemplate('age')
addr.registerTemplate('addr')

%sql

select * from age

笛卡尔链接 cross join

1
2
3
4
5
6
%sql

select * from age a
cross join addr b
# 显示写出cross join ,否则检查出cartesian product
# inner join必须写出on条件

inner join

1
2
3
4
5
6
% sql

select *
from age a
join addr b
on a.name = b.name

left join

1
2
3
4
5
% sql
select *
from age a
left join addr b
on a.name = b.name

#坐标数据全部显示,右边数据关联上显示,关联不上显示null

right join

1
2
3
4
5
6
7
%sql
select a.*, b.*
from age a
right join addr b
on a.name=b.name
-- where a.name is not null
-- having a.name is not null

左连接代替右连接

1
2
3
4
5
%sql
select a.*, b.*
from addr b
left join age a
on a.name =b.name

full join

1
2
3
4
select a.*, b.*
from age a
full join addr b
on a.name=b.name

左半连接(left semi join实现in)

1
2
3
4
select a.*
from age a
left semi join addr b
on a.name=b.name

多表时比inner join少,但效率高

not in

1
2
3
4
5
6
%sql
select a.*
from age a
left join addr b
on a.name=b.name
where b.name is null

exist

1
2
3
4
select * 
from age a
where exists
(select 1 from addr b where a.name=b.name)

not exists

1
2
3
4
select *
from age a
where not exists (
select 1 from addr b where a.name=b.name)
Sumer Zhang wechat
欢迎您扫一扫上面的微信公众号,订阅我的博客。