Quick start using PySpark
Now that we have configured and started our pyspark, lets go over some common functions that we will be using :-
Let's take a look at our data file.
Assume we started our shell with the following command :-
from pyspark import SparkContext
sc = SparkContext.getOrCreate()
tf = sc.textFile("j:\\tmp\\data.txt")
tf.filter(lambda a : "test" in a).count()
SyntaxError: invalid syntax
>>> tf.filter(lambda a : "test" in a).count()
3
Finds a line that contains
collect - really useful and it returns list of all element in a RDD
collect is pretty handy especially when you want to see results
>>> tf.filter(lambda a : "test" in a).collect()
['test11111', 'testing ', 'reest of the world; test11111']
map - returns RDDs by applying a function on it. Here I am applying upper case to my lines and i returns the results using collect()
>>> tf.map(lambda x : x.upper()).collect()
['I AM THE BEST IN THE WORLD OF SOCCER. SO SAYS MESSI...', 'TEST11111', 'BEST', 'TESTING ', 'GEORGE BEST', 'REEST OF THE WORLD; TEST11111', 'RONALDO']
reduce - this a a pretty confusing function at first. It basically takes 1st argument, 2nd argument and apply a function before moving on to the next arguments if there's any.
For example, say you going to merge all the line in your code together. All the lines will be converted to upper case. (Just like above).
>>> tf.map(lambda x : x.upper()).reduce(lambda a,b : a + b)
'I AM THE BEST IN THE WORLD OF SOCCER. SO SAYS MESSI...TEST11111BESTTESTING GEORGE BESTREEST OF THE WORLD; TEST11111RONALDO'
Note that all the text are now merged.
Comments