Spark Commands

The document outlines various operations in Apache Spark's RDD, including glom, map, mappartitions, and mapValues, highlighting their specific use cases. It contrasts map with flatMap and compares groupByKey with reduceByKey in terms of data shuffling and optimization. Additionally, it mentions creating a DataFrame from RDDs using a DDL schema with examples of CSV file paths.

Uploaded by

bhushan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views1 page

Spark Commands

Uploaded by

bhushan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

You are on page 1/ 1

rdd1=sc.

parallelize(rdd);

glom()-check elements in partitions

map()-apply operation on every element of rdd

mappartitions()-map partitions is called on

each and every partition in rdd not on single elements

mappartitionwithindex:it maps partitions along with index;

it returns value and index;

mapValues():it is used to apply operations where in rdd key

value pairs are present ,mapvalue applies operations on
values only.

MAP VS FLATMAP:
map:map applies function on each and every element present in rdd
flatmap:flatmap applies map internally but when giving output
it flattens the result.

groupbykey vs reducebykey:
groupbykey:(1,1,1),data shuffling happens more,not optimized
reducebykey:(3),less data shuffling happens,more optized.

creating dataframe,with ddl schema.,creating df from rdd

/FileStore/tables/employees__1_-1.csv
/FileStore/tables/departments__1_-1.csv

Spark Transformations and Actions
No ratings yet
Spark Transformations and Actions
24 pages
Visual Guide to Spark API Transformations
No ratings yet
Visual Guide to Spark API Transformations
122 pages
Spark RDD Guide for Developers
No ratings yet
Spark RDD Guide for Developers
7 pages
Spark RDD
No ratings yet
Spark RDD
4 pages
Indrani Cheat Sheet
No ratings yet
Indrani Cheat Sheet
2 pages
Optimizing Transformations and Actions - Part 2
No ratings yet
Optimizing Transformations and Actions - Part 2
3 pages
groupByKey VS reduceByKey
No ratings yet
groupByKey VS reduceByKey
3 pages
PySpark RDD Basics Cheat Sheet
No ratings yet
PySpark RDD Basics Cheat Sheet
1 page
PySpark RDD Basics PDF
No ratings yet
PySpark RDD Basics PDF
1 page
PySpark Cheat Sheet Spark in Python PDF
No ratings yet
PySpark Cheat Sheet Spark in Python PDF
1 page
PySpark RDD Cheat Sheet Guide
No ratings yet
PySpark RDD Cheat Sheet Guide
1 page
Spark - groupByKey Vs reduceByKey
No ratings yet
Spark - groupByKey Vs reduceByKey
3 pages
Top 100 Pyspark Functions For Data Engineers 1738131847
No ratings yet
Top 100 Pyspark Functions For Data Engineers 1738131847
30 pages
Ravi Pyspark RDD Tutorial 1665758938
No ratings yet
Ravi Pyspark RDD Tutorial 1665758938
20 pages
PySpark Notes
No ratings yet
PySpark Notes
4 pages
Spark Commands
No ratings yet
Spark Commands
3 pages
Py Spark
No ratings yet
Py Spark
19 pages
PySpark Essentials for Developers
100% (1)
PySpark Essentials for Developers
21 pages
Write Scala Code To Parallelize A Simple Collection (E.g., An Array or List) Into An RDD in Spark
No ratings yet
Write Scala Code To Parallelize A Simple Collection (E.g., An Array or List) Into An RDD in Spark
48 pages
Spark RDD Transformations & Actions
No ratings yet
Spark RDD Transformations & Actions
5 pages
Comp9313: Big Data Management: Mapreduce
No ratings yet
Comp9313: Big Data Management: Mapreduce
65 pages
Action and Transformations (Wide and Narrow)
No ratings yet
Action and Transformations (Wide and Narrow)
7 pages
SQL Cheat Sheet Python
100% (1)
SQL Cheat Sheet Python
1 page
Spark Pyspark Day 22
No ratings yet
Spark Pyspark Day 22
12 pages
Journal
No ratings yet
Journal
47 pages
Data Aggregation and Group Operations
No ratings yet
Data Aggregation and Group Operations
70 pages
Spark Cheatsheet - BEPEC
No ratings yet
Spark Cheatsheet - BEPEC
1 page
RDD Actions
No ratings yet
RDD Actions
18 pages
PySpark Transformations Tutorial
100% (1)
PySpark Transformations Tutorial
58 pages
EDA With Pandas
No ratings yet
EDA With Pandas
8 pages
Pyspark - DataFrame Window Functions
No ratings yet
Pyspark - DataFrame Window Functions
3 pages
Pandas Dataframe Cheat Sheet
No ratings yet
Pandas Dataframe Cheat Sheet
3 pages
Lab 04 Spark APIs
No ratings yet
Lab 04 Spark APIs
20 pages
PySpark RDD Transformations Guide
No ratings yet
PySpark RDD Transformations Guide
38 pages
Unit 4,5
No ratings yet
Unit 4,5
24 pages
Pyspark RDD Operations
No ratings yet
Pyspark RDD Operations
5 pages
PySpark RDD: Transformations & Operations
No ratings yet
PySpark RDD: Transformations & Operations
40 pages
A204080739 - 28953 - 20 - 2025 - Unit 3 Introduction To RDD
No ratings yet
A204080739 - 28953 - 20 - 2025 - Unit 3 Introduction To RDD
51 pages
Pandas Notes
No ratings yet
Pandas Notes
27 pages
Spark Class 2
No ratings yet
Spark Class 2
37 pages
Introduction To RDD
No ratings yet
Introduction To RDD
39 pages
PySpark Questions
No ratings yet
PySpark Questions
5 pages
Chapter-2 Python Pandas
100% (2)
Chapter-2 Python Pandas
33 pages
Docse
No ratings yet
Docse
3 pages
PySpark RDD Cheat Sheet Guide
No ratings yet
PySpark RDD Cheat Sheet Guide
1 page
SQL and PySpark Interview Questions
No ratings yet
SQL and PySpark Interview Questions
15 pages
Unit 5
No ratings yet
Unit 5
19 pages
Databricks Spark Knowledge Base
100% (1)
Databricks Spark Knowledge Base
22 pages
EDAV Exp10
No ratings yet
EDAV Exp10
4 pages
File, Directory, and Filesystem Commands: Command Name: - List Directory Contents
No ratings yet
File, Directory, and Filesystem Commands: Command Name: - List Directory Contents
8 pages
Deploy A Static Website On The Render Cloud
No ratings yet
Deploy A Static Website On The Render Cloud
11 pages
Creating A Linux EC2 Instance and Connect
No ratings yet
Creating A Linux EC2 Instance and Connect
7 pages
Python Rapid Fire
No ratings yet
Python Rapid Fire
8 pages
Pandas
No ratings yet
Pandas
13 pages
RDBMS Rapid Fire Sheet Questions
No ratings yet
RDBMS Rapid Fire Sheet Questions
50 pages
RDBMS Rapid Fire Sheet Questions (Copy)
No ratings yet
RDBMS Rapid Fire Sheet Questions (Copy)
5 pages
Interview Questions Chatgpt
No ratings yet
Interview Questions Chatgpt
3 pages
Kafka Interview Questions
No ratings yet
Kafka Interview Questions
4 pages
Data Engineering Interview Question and Ans Chatgpt
No ratings yet
Data Engineering Interview Question and Ans Chatgpt
21 pages

Spark Commands

Uploaded by

Spark Commands

Uploaded by

rdd1=sc.

glom()-check elements in partitions

map()-apply operation on every element of rdd

mappartitions()-map partitions is called on

mappartitionwithindex:it maps partitions along with index;

mapValues():it is used to apply operations where in rdd key

creating dataframe,with ddl schema.,creating df from rdd

You might also like