Advanced Spark Application Performance Tuning

Duration: 3 Days (24 Hours)

Advanced Spark Application Performance Tuning Course Overview:

Conducted over three days, this hands-on training course equips developers with the essential knowledge and techniques to enhance the performance of their Apache Spark applications. Throughout the course duration, participants will delve into the identification of common sources of performance bottlenecks within Spark applications, learn strategies for mitigating or resolving these issues, and internalize best practices for effective monitoring of Spark applications.

The curriculum of “Apache Spark Application Performance Tuning” initiates with a comprehensive exploration of Apache Spark’s architecture and foundational concepts, extending to the underlying data platform. Building upon this foundational understanding, the course delves into the art of tuning Spark application code for optimal performance. The instructional approach employs demonstrations led by the instructor, effectively showcasing performance challenges and the corresponding solutions. Subsequently, hands-on exercises provide participants with an interactive notebook environment to practice the techniques covered.

While the course primarily centers around Spark 2.4, it also introduces the Adaptive Query Execution framework of Spark 3.0. This holistic training experience empowers participants to proactively enhance the efficiency and effectiveness of their Spark applications, equipping them with valuable insights and practical skills.

Intended Audience:

  • Software developers
  • Engineers
  • Data scientists

Learning Objectives of Advanced Spark Application Performance Tuning:

Students who successfully complete this course will be able to:

  • Understand Apache Spark’s architecture, job execution, and how techniques such as lazy execution and pipelining can improve runtime performance
  • Evaluate the performance characteristics of core data structures such as RDD and DataFrames
  • Select the file formats that will provide the best performance for your application
  • Identify and resolve performance problems caused by data skew
  • Use partitioning, bucketing, and join optimizations to improve SparkSQL performance
  • Understand the performance overhead of Python-based RDDs, DataFrames, and user-defined functions
  • Take advantage of caching for better application performance
  • Understand how the Catalyst and Tungsten optimizers work
  • Understand how Workload XM can help troubleshoot and proactively monitor Spark applications performance
  • Learn about the new features in Spark 3.0 and specifically how the Adaptive Query Execution engine improves performance
Spark Architecture
  • RDDs
  • DataFrames and Datasets
  • Lazy Evaluation
  • Pipelining
  • Available Formats Overview
  • Impact on Performance
  • The Small Files Problem
  • The Cost of Inference
  • Mitigating Tactics
  • Recognizing Skew
  • Mitigating Tactics
  • Catalyst Overview
  • Tungsten Overview
  • Denormalization
  • Broadcast Joins
  • Map-Side Operations
  • Sort Merge Joins
  • Partitioned Tables
  • Bucketed Tables
  • Impact on Performance
  • Skewed Joins
  • Bucketed Joins
  • Incremental Joins
  • Pyspark Overhead
  • Scalar UDFs
  • Vector UDFs using Apache Arrow
  • Scala UDFs
  • Caching Options
  • Impact on Performance
  • Caching Pitfalls
  • WXM Overview
  • WXM for Spark Developers
  • Adaptive Number of Shuffle Partitions
  • Skew Joins
  • Convert Sort Merge Joins to Broadcast Joins
  • Dynamic Partition Pruning
  • Dynamic Coalesce Shuffle Partitions

Advanced Spark Application Performance Tuning Course Prerequisites

  • Spark examples and hands-on exercises are presented in Python and the ability to program in this language is required. Basic familiarity with the Linux command line is assumed. Basic knowledge of SQL is helpful.

Discover the perfect fit for your learning journey

Choose Learning Modality

Live Online

  • Convenience
  • Cost-effective
  • Self-paced learning
  • Scalability

Classroom

  • Interaction and collaboration
  • Networking opportunities
  • Real-time feedback
  • Personal attention

Onsite

  • Familiar environment
  • Confidentiality
  • Team building
  • Immediate application

Training Exclusives

This course comes with following benefits:

  • Practice Labs.
  • Get Trained by Certified Trainers.
  • Access to the recordings of your class sessions for 90 days.
  • Digital courseware
  • Experience 24*7 learner support.

Got more questions? We’re all ears and ready to assist!

Request More Details

Please enable JavaScript in your browser to complete this form.

Subscribe to our Newsletter

Please enable JavaScript in your browser to complete this form.
×