Quick start using PySpark

June 19, 2017

Now that we have configured and started our pyspark, lets go over some common functions that we will be using :-

Let's take a look at our data file.

Assume we started our shell with the following command :-

from pyspark import SparkContext

sc = SparkContext.getOrCreate()

tf = sc.textFile("j:\\tmp\\data.txt")

filter - filtering results tahat matches a boolean condition. Example

tf.filter(lambda a : "test" in a).count()

SyntaxError: invalid syntax
>>> tf.filter(lambda a : "test" in a).count()
3

Finds a line that contains

collect - really useful and it returns list of all element in a RDD

collect is pretty handy especially when you want to see results

>>> tf.filter(lambda a : "test" in a).collect()
['test11111', 'testing ', 'reest of the world; test11111']

map - returns RDDs by applying a function on it. Here I am applying upper case to my lines and i returns the results using collect()

>>> tf.map(lambda x : x.upper()).collect()
['I AM THE BEST IN THE WORLD OF SOCCER. SO SAYS MESSI...', 'TEST11111', 'BEST', 'TESTING ', 'GEORGE BEST', 'REEST OF THE WORLD; TEST11111', 'RONALDO']

reduce - this a a pretty confusing function at first. It basically takes 1st argument, 2nd argument and apply a function before moving on to the next arguments if there's any.

For example, say you going to merge all the line in your code together. All the lines will be converted to upper case. (Just like above).

>>> tf.map(lambda x : x.upper()).reduce(lambda a,b : a + b)
'I AM THE BEST IN THE WORLD OF SOCCER. SO SAYS MESSI...TEST11111BESTTESTING GEORGE BESTREEST OF THE WORLD; TEST11111RONALDO'

Note that all the text are now merged.

Search This Blog

mitzen

Quick start using PySpark

Comments

Popular posts from this blog

Nextjs - How do you handle onclick which do something

The specified initialization vector (IV) does not match the block size for this algorithm

Azure function error : Missing value for AzureWebJobsStorage in local.settings.json