Apache Spark / PySpark Course | CodeData

What You'll Learn

1

Installation, RDD fundamentals, Spark context, and cluster architecture overview.

2

Working with structured data, SQL queries on Datasets, and optimisations via Catalyst.

3

Map, filter, reduce operations, aggregations, and lazy evaluation principles.

4

Joins, window functions, and complex aggregations across large datasets.

5

MLlib for classification, clustering, and real-time data with Spark Streaming.

5

Partitioning strategies, caching, cluster deployment, and production optimisations.

Apache Spark

PySpark

Hadoop

AWS/GCP/Azure

Databricks

Essential for building scalable distributed data pipelines and processing big data at scale.

Recommended for large-scale machine learning projects and distributed computing.

Optional but valuable for understanding advanced big data analytics.

Get access to this course, plus 13 more professional data & AI courses.