0% found this document useful (0 votes)
13 views1 page

Spark Commands

The document outlines various operations in Apache Spark's RDD, including glom, map, mappartitions, and mapValues, highlighting their specific use cases. It contrasts map with flatMap and compares groupByKey with reduceByKey in terms of data shuffling and optimization. Additionally, it mentions creating a DataFrame from RDDs using a DDL schema with examples of CSV file paths.

Uploaded by

bhushan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views1 page

Spark Commands

The document outlines various operations in Apache Spark's RDD, including glom, map, mappartitions, and mapValues, highlighting their specific use cases. It contrasts map with flatMap and compares groupByKey with reduceByKey in terms of data shuffling and optimization. Additionally, it mentions creating a DataFrame from RDDs using a DDL schema with examples of CSV file paths.

Uploaded by

bhushan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 1

rdd1=sc.

parallelize(rdd);

glom()-check elements in partitions

map()-apply operation on every element of rdd

mappartitions()-map partitions is called on


each and every partition in rdd not on single elements

mappartitionwithindex:it maps partitions along with index;


it returns value and index;

mapValues():it is used to apply operations where in rdd key


value pairs are present ,mapvalue applies operations on
values only.

MAP VS FLATMAP:
map:map applies function on each and every element present in rdd
flatmap:flatmap applies map internally but when giving output
it flattens the result.

groupbykey vs reducebykey:
groupbykey:(1,1,1),data shuffling happens more,not optimized
reducebykey:(3),less data shuffling happens,more optized.

creating dataframe,with ddl schema.,creating df from rdd

/FileStore/tables/employees__1_-1.csv
/FileStore/tables/departments__1_-1.csv

You might also like