Apache Spark/Spark MLib Workshop

Time and Location

Apr 18, 2020 at 9:00am - 3:00pm EDT

Cobb Galleria Center

Agenda

Morning Session:

 

- Apache Spark DataFrames API (~10 transformations, ~5 actions)

- Spark Architecture: Driver + Executors

- Spark UI Intro: Jobs, Stages, Tasks

- Data Sources, data import/export, file formats

- Spark parallelism + Shuffle architecture

- Spark Scala API docs + advanced search of mailing lists

- Cluster Managers: YARN vs Standalone mode in Spark

- Spark Packages

- Tips for using the Spark source code

- PySpark performance considerations

- Dynamic Allocation

- Common Spark Configuration options

 

*One Coffee break during this session

 

Lunch

 

Afternoon Session:

 

- Shared variables: Accumulators and Broadcast variables

- spark.mllib vs. spark.ml

- Spark ML Pipelines (Transformers, Estimators, and Parameters)

- Classification: Logistic regression, Naive Bayes, Decision trees

- Classification: - Regression: Linear Regression, Decision tree, Random forest

- GraphFrames (subgraphs, shortest path, PageRank, Triangle Count, Label Propagation Algorithm)

- Ensemble algorithms: Random Forests and Gradient-Boosted Trees

- K-Means clustering

  • White Facebook Icon
  • White Twitter Icon

© All rights reserved 2018, Southern Data Science, LLC

Southern Data Science Conference Logo