Spark Python API函数学习：pyspark API(2)

如果想及时了解Spark、Hadoop或者Hbase相关的文章，欢迎关注微信公共帐号：iteblog_hadoop

sortBy

# sortBy
x = sc.parallelize(['Cat','Apple','Bat'])
def keyGen(val): return val[0]
y = x.sortBy(keyGen)
print(y.collect())

['Apple', 'Bat', 'Cat']

glom

# glom
x = sc.parallelize(['C','B','A'], 2)
y = x.glom()
print(x.collect()) 
print(y.collect())

['C', 'B', 'A']
[['C'], ['B', 'A']]

cartesian

# cartesian
x = sc.parallelize(['A','B'])
y = sc.parallelize(['C','D'])
z = x.cartesian(y)
print(x.collect())
print(y.collect())
print(z.collect())

['A', 'B']
['C', 'D']
[('A', 'C'), ('A', 'D'), ('B', 'C'), ('B', 'D')]

groupBy

# groupBy
x = sc.parallelize([1,2,3])
y = x.groupBy(lambda x: 'A' if (x%2 == 1) else 'B' )
print(x.collect())
# y is nested, this iterates through it
print([(j[0],[i for i in j[1]]) for j in y.collect()]) 

[1, 2, 3]
[('A', [1, 3]), ('B', [2])]

pipe

# pipe
x = sc.parallelize(['A', 'Ba', 'C', 'AD'])
y = x.pipe('grep -i "A"') # calls out to grep, may fail under Windows 
print(x.collect())
print(y.collect())

['A', 'Ba', 'C', 'AD']
['A', 'Ba', 'AD']

foreach

# foreach
from __future__ import print_function
x = sc.parallelize([1,2,3])
def f(el):
    '''side effect: append the current RDD elements to a file'''
    f1=open("./foreachExample.txt", 'a+') 
    print(el,file=f1)

# first clear the file contents
open('./foreachExample.txt', 'w').close()  

y = x.foreach(f) # writes into foreachExample.txt

print(x.collect())
print(y) # foreach returns 'None'
# print the contents of foreachExample.txt
with open("./foreachExample.txt", "r") as foreachExample:
    print (foreachExample.read())
    
[1, 2, 3]
None
3
1
2

foreachPartition

# foreachPartition
from __future__ import print_function
x = sc.parallelize([1,2,3],5)
def f(parition):
    '''side effect: append the current RDD partition contents to a file'''
    f1=open("./foreachPartitionExample.txt", 'a+') 
    print([el for el in parition],file=f1)

# first clear the file contents
open('./foreachPartitionExample.txt', 'w').close()  

y = x.foreachPartition(f) # writes into foreachExample.txt

print(x.glom().collect())
print(y)  # foreach returns 'None'
# print the contents of foreachExample.txt
with open("./foreachPartitionExample.txt", "r") as foreachExample:
    print (foreachExample.read())

[[], [1], [], [2], [3]]
None
[]
[]
[1]
[2]
[3]

collect

# collect
x = sc.parallelize([1,2,3])
y = x.collect()
print(x)  # distributed
print(y)  # not distributed

ParallelCollectionRDD[87] at parallelize at PythonRDD.scala:382
[1, 2, 3]

reduce

# reduce
x = sc.parallelize([1,2,3])
y = x.reduce(lambda obj, accumulated: obj + accumulated)  # computes a cumulative sum
print(x.collect())
print(y)

[1, 2, 3]
6

fold

# fold
x = sc.parallelize([1,2,3])
neutral_zero_value = 0  # 0 for sum, 1 for multiplication
y = x.fold(neutral_zero_value,lambda obj, accumulated: accumulated + obj) # computes cumulative sum
print(x.collect())
print(y)

[1, 2, 3]
6

aggregate

# aggregate
x = sc.parallelize([2,3,4])
neutral_zero_value = (0,1) # sum: x+0 = x, product: 1*x = x
seqOp = (lambda aggregated, el: (aggregated[0] + el, aggregated[1] * el)) 
combOp = (lambda aggregated, el: (aggregated[0] + el[0], aggregated[1] * el[1]))
y = x.aggregate(neutral_zero_value,seqOp,combOp)  # computes (cumulative sum, cumulative product)
print(x.collect())
print(y)

[2, 3, 4]
(9, 24)

max

# max
x = sc.parallelize([1,3,2])
y = x.max()
print(x.collect())
print(y)

[1, 3, 2]
3

min

# min
x = sc.parallelize([1,3,2])
y = x.min()
print(x.collect())
print(y)

[1, 3, 2]
1

sum

# sum
x = sc.parallelize([1,3,2])
y = x.sum()
print(x.collect())
print(y)

[1, 3, 2]
6

count

# count
x = sc.parallelize([1,3,2])
y = x.count()
print(x.collect())
print(y)

[1, 3, 2]
3

本博客文章除特别声明，全部都是原创！
原创文章版权归过往记忆大数据（过往记忆）所有，未经许可不得转载。
本文链接: 【Spark Python API函数学习：pyspark API(2)】（https://www.iteblog.com/archives/1396.html）

图文介绍 Presto + Velox 整合

Velox 介绍：一个开源的统一执行引擎

Presto 里面如何把 array 或 Map 里面的元素由行转成列

Data + AI Summit 2022 PPT 下载

Data + AI Summit 2022 超清视频下载

历时一年 Apache Spark 3.3.0 正式发布，新特性详解

Spark Structured Streaming 2021年最新进展的总结

Portable UDF：Facebook 工程师为了解决不同计算引擎 UDF 统一的项目

下面文章您可能感兴趣

Flink Forward 201904 PPT资料下载

Linux命令行下安装Maven与配置

Font Awesome:图标字体，完全CSS控制

2017年关于深度学习的十大趋势预测

HDFS 副本存放磁盘选择策略

盘点2018年晋升为Apache TLP的大数据相关项目

Docker 入门教程：Union File System 在 Docker 的应用

Rheem：可扩展且易于使用的跨平台大数据分析系统

为了让你更全面的了解Apache HBase，我们做了这本专刊

Apache CarbonData 1.4.0 中文文档翻译完成

脱离JVM？ Hadoop生态圈的挣扎与演化

SSDB：可用于替代Redis的高性能NoSQL数据库

Spark+AI Summit 2019 PPT 下载[共124个]

Delta Lake 提供纯 Scala\Java\Python 操作 API，和 Flink 整合更加容易

Hadoop: The Definitive Guide, 4th Edition[pdf]

Spark 3.0 中七个必须知道的 SQL 性能优化

Docker 入门教程：快速开始

Apache Cassandra Composite Key\Partition key\Clustering key 介绍

给Hadoop集群中添加Snappy解压缩库

Docker 入门教程：Docker 基础技术 Union File System

(1)个小伙伴在吐槽

讲解的很细致
100801152016-10-23 16:34 回复