Spark Mini Project

The document shows code for analyzing an H1B visa dataset using Spark. The code reads a parquet file, cleans the data, filters for certified cases, and writes the results to a CSV file.

Uploaded by

Rahul S.Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

113 views1 page

Spark Mini Project

The document shows code for analyzing an H1B visa dataset using Spark. The code reads a parquet file, cleans the data, filters for certified cases, and writes the results to a CSV file.

Uploaded by

Rahul S.Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 1

spark = SparkSession.builder.appName("H1B") .config("spark.some.config.

option",
"some-value").getOrCreate()
parquet1DF = spark.read.parquet("h1_b_dataset.parquet")
PARQUETdf11 = parquet1DF.select("CASE_STATUS", "VISA_CLASS", "EMPLOYER_NAME",
"JOB_TITLE", "PREVAILING_WAGE", "PW_SOURCE_YEAR", "WORKSITE_STATE")
p1 =
parquetDF11.withColumnRenamed("PREVAILING_WAGE","SALARY").withColumnRenamed("PW_SOU
RCE_YEAR","FINANCIAL_YEAR")
p2 = p1.where(p1.CASE_STATUS == "CERTIFIED")
p3 = p2.na.drop()
p4 = p3.selectExpr("cast(CASE_STATUS as string) CASE_STATUS","cast(VISA_CLASS as
string) VISA_CLASS","cast(EMPLOYER_NAME as string) EMPLOYER_NAME","cast(JOB_TITLE
as string) JOB_TITLE",
"cast(SALARY as double) SALARY","cast(FINANCIAL_YEAR as integer)
FINANCIAL_YEAR","cast(WORKSITE_STATE as string) WORKSITE_STATE")
p3.filter(~p3.EMPLOYER_NAME.endswith("LLC"))
p5= p4.filter(~p3.EMPLOYER_NAME.endswith("LLC"))

p6.write.format('csv').option('header',True).option('sep',',').save('c.csv')

m
er as
val data
=spark.read.option("header","true").option("InferSchema","true").parquet("h1_b_data

co
set.parquet")

eH w
val raw =data.select("CASE_STATUS", "VISA_CLASS", "EMPLOYER_NAME", "JOB_TITLE",

o.
"PREVAILING_WAGE", "PW_SOURCE_YEAR", "WORKSITE_STATE")
rs e
ou urc
val info =
raw.withColumnRenamed("PREVAILING_WAGE","SALARY").withColumnRenamed("PW_SOURCE_YEAR
","FINANCIAL_YEAR")
val value = info.filter(info("CASE_STATUS") === "CERTIFIED")
o
aC s

val raws = value.filter(~value("EMPLOYER_NAME).endswith("LLC"))

vi y re

val conditions = value.columns.map(value(_).endsWith("LLC")).reduce(_ or _)

val output = value.withColumn("condition", conditions).filter($"condition" ===
true).drop("condition")
ed d
ar stu
is
Th
sh

This study source was downloaded by 100000834862076 from CourseHero.com on 10-19-2021 02:39:43 GMT -05:00

https://www.coursehero.com/file/101518475/spark-mini-projecttxt/
Powered by TCPDF (www.tcpdf.org)

Spark Mini Project
No ratings yet
Spark Mini Project
1 page
Pyspark Spark SQL: Scenario Based Interview
No ratings yet
Pyspark Spark SQL: Scenario Based Interview
6 pages
Solutions 1742312993
No ratings yet
Solutions 1742312993
14 pages
Sanya Sekhri Assignment
No ratings yet
Sanya Sekhri Assignment
2 pages
Spark Class 1 Rough Notes
No ratings yet
Spark Class 1 Rough Notes
9 pages
PySpark - FP - Course ID 58339 - Hands On 4
No ratings yet
PySpark - FP - Course ID 58339 - Hands On 4
2 pages
Project On Payroll System
No ratings yet
Project On Payroll System
10 pages
Data and AI - Spark Python
No ratings yet
Data and AI - Spark Python
11 pages
DATAFRAME Vs DATASETS
No ratings yet
DATAFRAME Vs DATASETS
9 pages
Template 1747916920221
No ratings yet
Template 1747916920221
13 pages
PySpark - FP - Course ID 58339 - Hands On 1
No ratings yet
PySpark - FP - Course ID 58339 - Hands On 1
2 pages
Payroll Management System
No ratings yet
Payroll Management System
10 pages
Fusion HR Database Tables Guide
No ratings yet
Fusion HR Database Tables Guide
1 page
Tables in Fusion HCM
No ratings yet
Tables in Fusion HCM
1 page
Tables in Fusion HR
100% (1)
Tables in Fusion HR
1 page
BDP CW2 - A 2023-24
No ratings yet
BDP CW2 - A 2023-24
3 pages
Tables
No ratings yet
Tables
4 pages
Erp HR
No ratings yet
Erp HR
2 pages
Assignment 2 - Part 1 - Visa Data Project v3
No ratings yet
Assignment 2 - Part 1 - Visa Data Project v3
2 pages
Job Conversion
No ratings yet
Job Conversion
66 pages
Payroll Calculator & Database Code
No ratings yet
Payroll Calculator & Database Code
49 pages
Azure Code
No ratings yet
Azure Code
2 pages
Java Payroll System Project
100% (1)
Java Payroll System Project
16 pages
Spark-Scala Code
No ratings yet
Spark-Scala Code
3 pages
Pay Deductions Balances Service
No ratings yet
Pay Deductions Balances Service
3 pages
Dsba Project Main Et Easyvisa
No ratings yet
Dsba Project Main Et Easyvisa
46 pages
TreasuryDM4 5
No ratings yet
TreasuryDM4 5
10 pages
HR Queries
No ratings yet
HR Queries
6 pages
PySpark DataFrame Operations
No ratings yet
PySpark DataFrame Operations
103 pages
Journal
No ratings yet
Journal
47 pages
03 Company DB
No ratings yet
03 Company DB
3 pages
PySpark All Query
No ratings yet
PySpark All Query
22 pages
Logo 345 1649916983 Elasticsearch Search Agg Vs Oracle SQL
No ratings yet
Logo 345 1649916983 Elasticsearch Search Agg Vs Oracle SQL
34 pages
Dbms 5th Program
No ratings yet
Dbms 5th Program
9 pages
TreasuryDM1 2
No ratings yet
TreasuryDM1 2
9 pages
Net Salary Report in HR ABAP
No ratings yet
Net Salary Report in HR ABAP
16 pages
Employee Master DM
No ratings yet
Employee Master DM
7 pages
Spark Scala & Python Cheat Sheet
No ratings yet
Spark Scala & Python Cheat Sheet
10 pages
Get Payroll Pyxx Read Payroll Result
No ratings yet
Get Payroll Pyxx Read Payroll Result
9 pages
Informatica Transformations Guide
No ratings yet
Informatica Transformations Guide
12 pages
Part B Pgm11 DBMS Lab Employee Database
No ratings yet
Part B Pgm11 DBMS Lab Employee Database
6 pages
NAPSBulk Import Template
No ratings yet
NAPSBulk Import Template
35 pages
Oracle HRMS API
No ratings yet
Oracle HRMS API
6 pages
Big Data With Spark and Hadoop
No ratings yet
Big Data With Spark and Hadoop
9 pages
Py Spark 1
No ratings yet
Py Spark 1
11 pages
Day5-6 HCM Extracts - Data Export
No ratings yet
Day5-6 HCM Extracts - Data Export
4 pages
ASN Appointment
No ratings yet
ASN Appointment
67 pages
HRMS Payroll in Oracle Apps
No ratings yet
HRMS Payroll in Oracle Apps
71 pages
Example8 XML
No ratings yet
Example8 XML
6 pages
Pyspark Coding Questions From StrataScratch Platform
No ratings yet
Pyspark Coding Questions From StrataScratch Platform
23 pages
Codes
No ratings yet
Codes
4 pages
Java Payroll System GUI App
No ratings yet
Java Payroll System GUI App
5 pages
Employee Payroll System Description Verru Anna
No ratings yet
Employee Payroll System Description Verru Anna
10 pages
Apache Spark With Scala - Cheatsheet
No ratings yet
Apache Spark With Scala - Cheatsheet
7 pages
Transaction Query
No ratings yet
Transaction Query
2 pages
Scala Constructs: Concepts of Functional Programming
No ratings yet
Scala Constructs: Concepts of Functional Programming
21 pages
2 Complex Systems & Microprocessors Part I 2020
No ratings yet
2 Complex Systems & Microprocessors Part I 2020
23 pages
8 Complex Systems & Microprocessors Part II
No ratings yet
8 Complex Systems & Microprocessors Part II
17 pages
Scala
No ratings yet
Scala
16 pages
DMW Module 3
No ratings yet
DMW Module 3
112 pages
DMW Module 5
No ratings yet
DMW Module 5
126 pages
TCS Employee Skill Development
No ratings yet
TCS Employee Skill Development
33 pages
DMW Module 3
No ratings yet
DMW Module 3
112 pages
Player - Name DOB Batting - Hand Bowling - Skill Country
No ratings yet
Player - Name DOB Batting - Hand Bowling - Skill Country
13 pages
What Is Microsoft Azure
No ratings yet
What Is Microsoft Azure
20 pages

Spark Mini Project

Uploaded by

Spark Mini Project

Uploaded by

spark = SparkSession.builder.appName("H1B") .config("spark.some.config.

val raws = value.filter(~value("EMPLOYER_NAME).endswith("LLC"))

val conditions = value.columns.map(value(_).endsWith("LLC")).reduce(_ or _)

You might also like