Skip to content

Commit

Permalink
Source Code for Spark Week Day 3
Browse files Browse the repository at this point in the history
Added the code from podcast andkret#102
And the Movies.txt file I got from :
https://perso.telecom-paristech.fr/eagan/class/igr204/datasets
  • Loading branch information
andkret committed Jul 17, 2019
1 parent 6d79824 commit 05da0a0
Show file tree
Hide file tree
Showing 9 changed files with 1,721 additions and 316 deletions.
59 changes: 59 additions & 0 deletions Code Examples/#102 Spark Week Day 3.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
//Read in the textfile
val input = sc.textFile("/notebook/Movies.txt")

case class MovieLine(Line: String)

val movieline = input.map(line => MovieLine(line))

movieline.toDF().registerTempTable("MovieLine")


// Lets map the date and the genre
case class DateAndGenre(myDate: String, Genre: String)

val dateandgenre = input.map(line => line.split(";")).map(s => DateAndGenre( s(0),s(3) ))

dateandgenre.toDF().registerTempTable("DateAndGenre")

// count how many movies per year
case class MovieDate(Line: String, myCount: Int)

val countdate = input.map(line => line.split(";")).map(s => (s(0),1))
countdate.toDF().registerTempTable("countdate")

val reduceddate = countdate.reduceByKey((a,b) => a + b).map(s => MovieDate(s._1,s._2))

reduceddate.toDF().registerTempTable("MovieDate")

//flatten every word into a new line in the RDD
val flatmappedinput = input.flatMap(line => line.split(";") )
flatmappedinput.toDF().registerTempTable("flatinput")

// read input directly to dataframe
val inputasdf = spark.read.format("csv").option("header", "true").option("delimiter", ";").load("/notebook/Movies.txt")
inputasdf.registerTempTable("inputdf")

/* //Use this to store the dataframe as parquet on the local drive
val reduceddf = reduceddate.toDF()
reduceddf.write.parquet("/notebook/movie.parquet")
*/

//read the parquetfile
val parquetFileDF = spark.read.parquet("/notebook/movie.parquet")
parquetFileDF.registerTempTable("ParquetRead")


//SparkSQL Queries:

//Visualize the raw RDD
%sql select * from MovieLine

//Visualize the map reduced RDD with count of movies per year
%sql select Line, myCount from MovieDate order by myCount desc

//Visualize the maped RDD and count the nr. of movies per year in SparkSQL
%sql select myDate, count(myDate) as counted from DateAndGenre group by myDate order by counted desc

%sql select * from flatinput

%sql select * from ParquetRead
1,661 changes: 1,661 additions & 0 deletions Code Examples/Movies.txt

Large diffs are not rendered by default.

Binary file modified Data Engineering Cookbook.pdf
Binary file not shown.
2 changes: 1 addition & 1 deletion Data Engineering Cookbook.tex
Original file line number Diff line number Diff line change
Expand Up @@ -2638,7 +2638,7 @@ \section{Data Science @Zalando}



\part{1001 Interview Questions}
\part{1001 Data Engineering Interview Questions}

Looking for a job or just want to know what people find important?
In this chapter you can find a lot of interview questions we collect on the stream.
Expand Down
1 change: 0 additions & 1 deletion PDFPreviewMaker/Previewmaker.aux

This file was deleted.

309 changes: 0 additions & 309 deletions PDFPreviewMaker/Previewmaker.log

This file was deleted.

Binary file removed PDFPreviewMaker/Previewmaker.pdf
Binary file not shown.
Binary file removed PDFPreviewMaker/Previewmaker.synctex.gz
Binary file not shown.
5 changes: 0 additions & 5 deletions PDFPreviewMaker/Previewmaker.tex

This file was deleted.

0 comments on commit 05da0a0

Please sign in to comment.