Predicting with Cloudera Machine Learning -

DSCI-272: Predicting with Cloudera Machine Learning

Duration: 4 Days (32 Hours)

Overview

Course Details

Prerequisites

DSCI-272: Predicting with Cloudera Machine Learning Course Overview:

For effective collaboration and streamlined processes, enterprise data science teams necessitate unified access to both business data and the essential tools and computing resources necessary for the development and deployment of machine learning workflows. Addressing this need, Cloudera Machine Learning (CML), integrated within the Cloudera Data Platform (CDP), offers a comprehensive solution, catering to the requirements of data science teams.

Spanning four days, this course focuses on mastering machine learning workflows and their operational aspects using CML. Participants delve into the realms of data exploration, visualization, and analysis, alongside the pivotal tasks of training, evaluating, and deploying machine learning models.

The curriculum guides participants through an end-to-end journey of data science and machine learning, grounded in practical scenarios and datasets drawn from a fictional technology enterprise. Throughout the course, demonstrations and hands-on exercises are conducted using Python (utilizing PySpark) within the CML framework. This comprehensive training equips attendees to navigate intricate data science and machine learning workflows with confidence and proficiency.

Intended Audience:

The course is designed for data scientists who need to understand how to utilize Cloudera Machine Learning and the Cloudera Data Platform to achieve faster model development and deliver production machine learning at scale.
Data engineers, developers, and solution architects who collaborate with data scientists will also find this course valuable.

Learning Objectives of DSCI-272: Predicting with Cloudera Machine Learning:

Through lecture and hands-on exercises, you will learn how to:

Utilize Cloudera SDX and other components of the Cloudera Data Platform to locate data for machine learning experiments
Use an Applied ML Prototype (AMP)
Manage machine learning experiments
Connect to various data sources and explore data
Utilize Apache Spark and Spark ML
Deploy an ML model as a REST API
Manage and monitor deployed ML models

Introduction to CML

Overview
CML Versus CDSW
ML Workspaces
Workspace Roles
Projects and Teams
Settings
Runtimes/Legacy Engines

Introduction to AMPs and the Workbench

Editors and IDE
Git
Embedded Web Applications
AMPs

Data Access and Lineage

SDX Overview
Data Catalog
Authorization
Lineage

Data Visualization in CML

Data Visualization Overview
CDP Data Visualization Concepts
Using Data Visualization in CML

Experiments

Experiments in CML

Introduction to the CML Native Workbench

Entering Code
Getting Help
Accessing the Linux Command Line
Working With Python Packages
Formatting Session Output

Spark Overview

How Spark Works
The Spark Stack
File Formats in Spark
Spark Interface Languages
Introduction to PySpark
How DataFrame Operations Become Spark Jobs
How Spark Executes a Job

Running a Spark Application

Running a Spark Application
Reading data into a Spark SQL DataFrame
Examining the Schema of a DataFrame
Computing the Number of Rows and Columns of a DataFrame
Examining a Few Rows of a DataFrame
Stopping a Spark Application

Inspecting a Spark DataFrame

Inspecting a DataFrame
Inspecting a DataFrame Column

Transforming DataFrames

Spark SQL DataFrames
Working with Columns
Working with Rows
Working with Missing Values

Transforming DataFrame Columns

Spark SQL Data Types
Working with Numerical Columns
Working with String Columns
Working with Date and Timestamp Columns
Working with Boolean Columns

Complex Types

Complex Collection Data Types
Arrays
Maps
Structs

User-Defined Functions

User-Defined Functions
Example 1: Hour of Day
Example 2: Great-Circle Distance

Reading and Writing DataFrames

Working with Delimited Text Files
Working with Text Files
Working with Parquet Files
Working with Hive Tables
Working with Object Stores
Working with Pandas DataFrames

Combining and Splitting DataFrames

Combining and Splitting DataFrames
Joining DataFrames
Splitting a DataFrame

Summarizing and GroupingDataFrames

Summarizing Data with Aggregate Functions
Grouping Data
Pivoting Data

Window Functions

Window Functions
Example: Cumulative Count and Sum
Example: Compute Average Days Between Rides for Each Rider

Machine Learning Overview

Introduction to Machine Learning
Machine Learning Tools

Apache Spark MLlib

Introduction to Apache Spark MLlib

Exploring and Visualizing DataFrames

Possible Workflows for Big Data
Exploring a Single Variable
Exploring a Pair of Variables

Monitoring, Tuning, and Configuring Spark Applications

Monitoring Spark Applications
Configuring the Spark Environment

Fitting and Evaluating Regression Models

Assemble the Feature Vector
Fit the Linear Regression Model

Fitting and Evaluating Classification Models

Generate Label
Fit the Logistic Regression Model

Tuning Algorithm Hyperparameters Using Grid Search

Requirements for Hyperparameter Tuning
Tune the Hyperparameters Using Holdout Cross-Validation
Tune the Hyperparameters Using K-Fold Cross-Validation

Fitting and Evaluating Clustering Models

Print and Plot the Home Coordinates
Fit a Gaussian Mixture Model
Explore the Cluster Profiles

Processing Text: Fitting and Evaluating Topic Models

Fit a Topic Model Using Latent Dirichlet Allocation

Fitting and Evaluating Recommender Models

Recommender Models
Generate Recommendations

Working with Machine Learning Pipelines

Fit the Pipeline Model
Inspect the Pipeline Model

Applying a Scikit-Learn Model to a Spark DataFrame

Build a Scikit-Learn Model
Apply the Model Using a Spark UDF

Deploying a Machine Learning Model as a REST API in CML

Load the Serialized Model
Define a Wrapper Function to Generate a Prediction
Test the Function

Autoscaling, Performance, and GPU Settings

Autoscaling Workloads
Working with GPUs

Model Metrics and Monitoring

Why Monitor Models?
Common Models Metrics
Models Monitoring With Evidently
Continuous Model Monitoring

DSCI-272: Predicting with Cloudera Machine Learning Course Prerequisites

Familiarity with basic concepts of machine learning and predictive modeling.
A working understanding of Cloudera Machine Learning (CML) platform.
Proficiency in using Python for data manipulation and analysis.
Basic knowledge of data preprocessing techniques and feature engineering.
Experience with data visualization and interpretation of machine learning models.

Discover the perfect fit for your learning journey

Choose Learning Modality

Live Online

Convenience
Cost-effective
Self-paced learning
Scalability

Classroom

Interaction and collaboration
Networking opportunities
Real-time feedback
Personal attention

Onsite

Familiar environment
Confidentiality
Team building
Immediate application

Training Exclusives

This course comes with following benefits:

Practice Labs.
Get Trained by Certified Trainers.
Access to the recordings of your class sessions for 90 days.
Digital courseware
Experience 24*7 learner support.

Request Free Demo

Got more questions? We’re all ears and ready to assist!