Introduction in Apache Spark

Începător

Introduction in Apache Spark

Durată: 3 zile

Certificare: Diploma de participare

Cui îi este dedicat cursul?

Data engineers, analysts, architects, software engineers, IT operations and technical managers interested in the overall architecture & components of Apache Spark and understanding Spark through different exercises and use cases and interactions with different distributed storages (HDFS,noSQL).

Cunoștințe și abilități inițiale

All exercises will be done in Scala and SQL. Prior knowledge of Scala and SQL syntax cloud be of help for easier understanding of the exercises, but please note that the main scope of the course is to understand the architecture, ways of working and usage of Spark in different use cases. Scala programming is not part of the course objectives.

Prezentarea cursului

The course covers Apache Spark main concepts: the core (architecture, RDDs/Dataframes/Datasets, transformations & actions, DAG), SQL engine, streaming engine, machine learning libraries and as well highlights the possible usage of Spark in different use cases like: ETL, analytics and Machine Learning.

Ce subiecte abordează cursul

Day 1

Spark Overview
- A brief history of Spark
- Where Spark fits in the big data landscape
- Apache Spark vs. Apache MapReduce: An overall architecture comparison
Architecture:
- Cluster Architecture
  - cluster manager, workers, executors
  - Spark Context
  - Cluster Manager Types
  - Deployment scenarios
- How Spark schedules and executes jobs and tasks
Resilient Distributed Datasets: Fundamentals & hands on exercises
- Ways to create an RDD
  - Parallelize Collection
  - Read from external data source (local drive, HDFS, noSQL, ...)
  - From existing RDD
- Introduction to Transformations and Actions
- Caching
- RDD Types
- How transformations lazily build up a Directed Acyclic Graph (DAG)
- Shuffling
- Hands on: using Spark for ETL

Day 2

SparkSQL and DataFrames/Datasets: Fundamentals and hands on exercises
- What are DataFrames/Datasets vs RDD’s
- The DataFrames/Datasets API
- Catalyst Optimizer
- Spark SQL
- Creating and running DataFrame operations
- Reading from multiple data sources (hands on exercises)

Day 3

Spark Streaming
- When to use Structured Spark Streaming
- Structured streaming:
  - Building a Spark streams out of Kafka topics
  - Windowing & Aggregation
  - Register a Spark DF stream in memory and query with Spark ML
Spark’s MLlib and MLlib Pipeline API (Spark.ml) for Machine Learning
- Spark MLlib and Spark.ml
- Machine Learning Examples:
  - Collaborative filtering: Alternating Least Squares
  - Classification and regression
An end to end Spark example
- We will build an end to end case, from data input, data cleaning, data storage and machine learning. We will work in a cloud environment and we will use Apache Zeppelin for all the Spark coding/exercises (Scala).

Ce abilități se dobândesc în urmă cursului

Understanding the main concepts of Apache Spark.
The usage of Spark in different use cases.

Course Requirements :

Please have a free Internet connection (port 22 open) and Google Chrome available on the working station. Also we recommend an SSH client available on the working station.

Introduction in Apache Spark