top of page

Apache Spark/Spark MLib Workshop

Time and Location

Apr 18, 2020 at 9:00am - 3:00pm EDT

Cobb Galleria Center


Morning Session:


- Apache Spark DataFrames API (~10 transformations, ~5 actions)

- Spark Architecture: Driver + Executors

- Spark UI Intro: Jobs, Stages, Tasks

- Data Sources, data import/export, file formats

- Spark parallelism + Shuffle architecture

- Spark Scala API docs + advanced search of mailing lists

- Cluster Managers: YARN vs Standalone mode in Spark

- Spark Packages

- Tips for using the Spark source code

- PySpark performance considerations

- Dynamic Allocation

- Common Spark Configuration options


*One Coffee break during this session




Afternoon Session:


- Shared variables: Accumulators and Broadcast variables

- spark.mllib vs.

- Spark ML Pipelines (Transformers, Estimators, and Parameters)

- Classification: Logistic regression, Naive Bayes, Decision trees

- Classification: - Regression: Linear Regression, Decision tree, Random forest

- GraphFrames (subgraphs, shortest path, PageRank, Triangle Count, Label Propagation Algorithm)

- Ensemble algorithms: Random Forests and Gradient-Boosted Trees

- K-Means clustering

bottom of page