Apache Spark/Spark MLib Workshop
Time and Location
Apr 18, 2020 at 9:00am - 3:00pm EDT
Cobb Galleria Center
Agenda
Morning Session:
- Apache Spark DataFrames API (~10 transformations, ~5 actions)
- Spark Architecture: Driver + Executors
- Spark UI Intro: Jobs, Stages, Tasks
- Data Sources, data import/export, file formats
- Spark parallelism + Shuffle architecture
- Spark Scala API docs + advanced search of mailing lists
- Cluster Managers: YARN vs Standalone mode in Spark
- Spark Packages
- Tips for using the Spark source code
- PySpark performance considerations
- Dynamic Allocation
- Common Spark Configuration options
*One Coffee break during this session
Lunch
Afternoon Session:
- Shared variables: Accumulators and Broadcast variables
- spark.mllib vs. spark.ml
- Spark ML Pipelines (Transformers, Estimators, and Parameters)
- Classification: Logistic regression, Naive Bayes, Decision trees
- Classification: - Regression: Linear Regression, Decision tree, Random forest
- GraphFrames (subgraphs, shortest path, PageRank, Triangle Count, Label Propagation Algorithm)
- Ensemble algorithms: Random Forests and Gradient-Boosted Trees
- K-Means clustering