• Length:
    4 Weeks
  • Effort:
    5–10 hours per week
  • Price:

    FREE
    Add a Verified Certificate for $99 USD

  • Institution
  • Subject:
  • Level:
    Intermediate
  • Language:
    English
  • Video Transcript:
    English

Prerequisites

  • Python programming background
  • experience with PySpark equivalent to CS105x: Introduction to Spark
  • comfort with mathematical and algorithmic reasoning
  • familiarity with basic machine learning concepts
  • exposure to algorithms, probability, linear algebra and calculus

About this course

Machine learning aims to extract knowledge from data, relying on fundamental concepts in computer science, statistics, probability and optimization. Learning algorithms enable a wide range of applications, from everyday tasks such as product recommendations and spam filtering to bleeding edge applications like self-driving cars and personalized medicine. In the age of ‘big data’, with datasets rapidly growing in size and complexity and cloud computing becoming more pervasive, machine learning techniques are fast becoming a core component of large-scale data processing pipelines.

This statistics and data analysis course introduces the underlying statistical and algorithmic principles required to develop scalable real-world machine learning pipelines. We present an integrated view of data processing by highlighting the various components of these pipelines, including exploratory data analysis, feature extraction, supervised learning, and model evaluation. You will gain hands-on experience applying these principles using Spark, a cluster computing system well-suited for large-scale machine learning tasks, and its packages spark.ml and spark.mllib. You will implement distributed algorithms for fundamental statistical models (linear regression, logistic regression, principal component analysis) while tackling key problems from domains such as online advertising and cognitive neuroscience.

What you'll learn

  • The underlying statistical and algorithmic principles required to develop scalable real-world machine learning pipelines
  • Exploratory data analysis, feature extraction, supervised learning, and model evaluation
  • Application of these principles using Spark
  • How to implement distributed algorithms for fundamental statistical models

Meet your instructors

Ameet Talwalkar
Assistant Professor of Computer Science
University of California, Los Angeles
Jon Bates
Spark Instructor
Databricks

Pursue a Verified Certificate to highlight the knowledge and skills you gain $99.00

View a PDF of a sample edX certificate
  • Official and Verified

    Receive an instructor-signed certificate with the institution's logo to verify your achievement and increase your job prospects

  • Easily Shareable

    Add the certificate to your CV or resume, or post it directly on LinkedIn

  • Proven Motivator

    Give yourself an additional incentive to complete the course

  • Support our Mission

    EdX, a non-profit, relies on verified certificates to help fund free education for everyone globally

Who can take this course?

Unfortunately, learners from one or more of the following countries or regions will not be able to register for this course: Iran, Cuba and the Crimea region of Ukraine. While edX has sought licenses from the U.S. Office of Foreign Assets Control (OFAC) to offer our courses to learners in these countries and regions, the licenses we have received are not broad enough to allow us to offer this course in all locations. EdX truly regrets that U.S. sanctions prevent us from offering all of our courses to everyone, no matter where they live.