apache spark

In Apache Spark training course, Participants will build complete, unified big data applications that combine batch, streaming, and interactive analytics on all data. They will learn to use Spark to write sophisticated parallel applications for faster decisions, better decisions, and real-time actions, applied to a wide variety of use cases, architectures, and industries.

By attending Apache Spark workshop, Participants will learn to:

  • Use the Spark shell for interactive data analysis
  • Features of Spark's Resilient Distributed Datasets
  • Fundamentals of running Spark on a cluster
  • Parallel programming with Spark
  • Write Spark applications
  • Process streaming data with Spark

  • Some programming experience (Python and Scala suggested)
  • Basic knowledge of Linux
  • Knowledge of Hadoop not required

COURSE AGENDA

  • Shared Variables: Broadcast Variables
  • Shared Variables: Accumulators
  • Common Performance Issues
  • Iterative Algorithms
  • Graph Analysis
  • Machine Learning
  • Spark and the Hadoop Ecosystem
  • Spark and MapReduce
  • Example: Streaming Word Count
  • Other Streaming Operations
  • Sliding Window Operations
  • Developing Spark Streaming Applications
  • RDD Lineage
  • Caching Overview
  • Distributed Persistence
  • Spark Applications vs. Spark Shell
  • Creating the SparkContext
  • Configuring Spark Properties
  • Building and Running a Spark Application
  • Logging
  • A Spark Standalone Cluster
  • The Spark Standalone Web UI
  • RDD Partitions and HDFS Data Locality
  • Working with Partitions
  • Executing Parallel Operations
  • Why HDFS?
  • HDFS Architecture
  • Using HDFS
  • Problems with Traditional Large-Scale Systems
  • Introducing Spark
  • What is Apache Spark?
  • Using the Spark Shell
  • Resilient Distributed Datasets (RDDs)
  • Functional Programming with Spark
  • RDD Operations
  • Key-Value Pair RDDs
  • MapReduce and Pair RDD Operations