In addition to providing more than 50 common machine learning algorithms, the Spark MLlib library provides abstractions for managing and simplifying many of the machine learning model building tasks, such as featurization, a pipeline for constructing, evaluating and tuning model, and persistence of models to help with moving the model from development to production.

Starting with Spark 2.0, the MLlib APIs will be based on DataFrames to take advantage of the user-friendliness and many optimizations provided by the Catalyst and Tungsten components in the Spark SQL engine.

Machine learning algorithms are iterative in nature, meaning they run through many iterations until the desired objective is achieved. Spark makes it extremely easy to implement those algorithms and run them in a scalable manner through a cluster of machines.

Commonly used machine learning algorithms such as classification, regression, clustering, and collaborative filtering are available out of the box for data scientists and engineers to use.

for more in-depth knowledge on Spark, Enroll for live demo on Apache Spark Certification

One of the motivations behind the creation of Spark was to help applications run iterative algorithms efficiently at scale. Over the last few versions of Spark, the MLlib library has steadily increased its offerings to make ML scalable and easy to use by providing a set of commonly used ML algorithms and a set of tools to facilitate the process of building and evaluating ML models.

To appreciate the features that the MLlib library provides, it is necessary to have a fundamental understanding of the process of building ML applications.

ML is a vast and fascinating field of study, which combines parts of other fields of studies such as mathematics, statistics, and computer science. It is a method of teaching computers to learn patterns and derive insights from historical data, often for the purpose of making decisions or predictions.

Unlike traditional, hard-coded software, ML gives you only probabilistic outputs based on the imperfect data you provide. The more data you can provide to ML algorithms, the more accurate the output will be. ML can solve much more interesting and difficult problems than traditional software can, and these problems are not specific to any industry or business domain.

Examples of these relevant areas are image recognition, speech recognition, language translation, fraud detection, product recommendations, robotics, autonomous driving cars, speeding up the drug discovery process, medical diagnosis, customer churn prediction, recommendations, and many more.

Given that the goal of AI is to make machines seem like they have intelligence, one of the best ways to measure that is by comparing machine intelligence against human intelligence.

There are a few well-known and publicized demonstrations of such comparisons in recent decades. The first one was a computer system called Deep Blue that defeated the world chess champion in 1997 under strict tournament regulations. This example demonstrates that computer machines can think faster and better than a human in a game that has a vast but limited set of possible moves.

The second one was a computer system called Watson that competed on the Jeopardy game show against two legendary champions in 2011 and won the first prize of $1 million. The example demonstrates computer machines can understand human language in a specific question-and-answer structure and then tap into their vast knowledge base to come up with probabilistic answers.

The third one is about a computer program called AlphaGo that defeated a world champion in the game of Go in a historic match in 2016. This example demonstrates a great leap in the advancement of the AI field because Go is considered to be a complex board game that requires intuition and creative and strategic thinking, and it is not feasible to perform an exhaustive move search because of the number of possible moves it has is greater than the number of atoms in the universe.

There are many ML libraries to choose from. In the era of big data, there are two reasons to pick Spark MLlib over the other options. The first one is ease of use. Spark SQL provides a user-friendly way of performing data exploratory analysis, and the MLlib library provides a means to build, manage, and persist complex ML pipelines. The second reason is performing ML at scale. The combination of the Spark unified data analytics engine and the MLlib library can support training machine learning models with billions of observations and thousands of features.

the ML process is essentially a pipeline that consists of a series of steps that run in a sequential manner and that usually need to be repeated several times to arrive at an optimal model.

Aligning with the goal of making practical machine learning easy, Spark MLlib provides a set of abstractions to help simplify the steps of data cleaning, featuring engineering, model training, model tuning, and evaluation as well as organizing them into a pipeline to make it easy to understand, maintain, and repeat. The pipeline concept is actually inspired by the sci-kit-learn library mentioned earlier.

Like other components within the Spark unified data analytics engine, MLlib is switching to DataFrame-based APIs to provide more user-friendly APIs and to take advantage of the optimizations the Spark SQL engine provides. The new APIs are available in the package org.apache.spark.ml.

The first MLlib version was developed on RDD-based APIs, and it is still supported, but it is in maintenance mode only. The old APIs are available in the package org.apache. spark.mllib. Once the feature parity is reached, then the RDD-based APIs will be deprecated.