Apache Spark

A distributed processing system utilized for big data workloads, supporting batch processing, stream processing and machine learning, with in-memory caching and optimized query execution for large datasets

by Matei Zaharia, Berkley's AMPLab, Apache Software Foundation ·

4.1

5 ·⚖️ Free · Open

News ·Stack Overflow Q&A ·Community/Mailing Lists ·Documentation ·FAQ ·IRC

Features & Limitations

+	Swift Processing	Achieves high data processing speed by reducing read-write to disk
+	Dynamic Nature	Supports the development of parallel applications with 80 high-level operators
+	In-Memory Computation	Increases processing speed by caching data, avoiding disk fetch each time
+	Reusability	Allows code reuse for batch-processing, stream joining, and ad-hoc queries
+	Fault Tolerance	Utilizes RDD abstraction to handle worker node failures, reducing data loss
+	Real-Time Stream Processing	Handles real-time data, overcoming Hadoop MapReduce limitations
+	Lazy Evaluation	Transforms in Spark RDD are lazy, increasing system efficiency
+	Multiple Language Support	Provides APIs in Java, Scala, Python, and R
+	DAG Execution Engine	Facilitates in-memory computation and acyclic data flow
+	SQL Interface	Offers SQL2003-compliant interface for querying data
+	Machine Learning Pipelines	Enables easy implementation of feature extraction and transformations
+	Graph Processing	Supports analysis techniques for data at scale
+	Structured Streaming	High-level API for creating infinite streaming dataframes and datasets
+	Resilient Distributed Dataset (RDD)	Represents an immutable collection of objects across a cluster
+	Distributed Processing	Distributes data processing tasks across multiple computers
+	Dataframe Approach	Borrowed from R and Python for processing structured data
+	Unified Analytics Engine	Large-scale data processing with built-in modules for SQL and ML
+	Lightweight Engine	Unified analytics engine for large-scale data processing
+	Runs Everywhere	Compatible with Hadoop, Mesos, Kubernetes, standalone, or cloud
+	Data Integration	Reduces cost and time for ETL processes
+	Interactive Analytics	Generates rapid responses for handling data interactively
+	Machine Learning at Scale	Stores data in memory for quick machine-learning algorithm processing
+	Cluster Management	Can run in stand-alone mode or with robust resource management systems
-	Limited Configurability and Optimization	Lacks automatic optimization and requires manual configuration for efficient resource usage. This can be complex for new users.
-	Data Management	Doesn’t have its own file management system and struggles with small files, impacting efficiency.
-	Limited Real-time Processing	While powerful for batch processing, it has limitations for truly real-time data pipelines.
-	High Resource Requirements	In-memory processing can be memory-intensive, leading to high hardware costs and potential latency issues.
-	Integration Complexity	Integrating with other systems can be challenging, requiring careful planning.
-	Scalability with Effort	While scalable, it requires proper resource allocation and management for optimal performance.
-	Fewer Algorithms	Compared to other platforms, it offers a limited set of machine learning algorithms.

Platform

Social

System Requirements

Version ↓

#	Minimum
1	8-16 cores per machine
2	8 GB to hundreds of GBs
3	4-8 disks per node
4	10 Gigabit or higher network

#	Minimum
1	8-16 cores per machine
2	8 GB to hundreds of GBs
3	4-8 disks per node
4	10 Gigabit or higher network

Ratings

4.08

PAT RESEARCH	7.7 10 based on professional's opinion
PAT RESEARCH	8.2 10 based on 2 reviews
TrustRadius	8.6 10 based on 101 reviews

Developer

Matei Zaharia, Berkley's AMPLab, Apache Software Foundation

Written in

Scala, Java, Python, R

Initial Release

26 May 2014

Repository

https://github.com/apache/spark

License

Apache v2

Alternatives

Data Analytics
KNIME Analytics Platform
Machine Learning
Apache Mahout Massive Online Analysis TensorFlow Apache MXNet Apache SystemDS Eclipse Deeplearning4j MALLET mlpack OpenCV Orange PyTorch scikit-learn The Microsoft Cognitive Toolkit Torch Weka Yooreeka
Cloud Computing
Apache Mahout Eureka Kubecost Pulumi IaC Infracost Terraform by HashiCorp Velero Apache Hadoop

Notes

Libraries:

Spark SQL is Apache Spark’s module for working with structured data.
Spark Connect is a protocol that specifies how a client application can communicate with a remote Spark Server.
Spark Streaming makes it easy to build scalable fault-tolerant streaming applications.
MLlib is Apache Spark’s scalable machine learning library.
GraphX is Apache Spark’s API for graphs and graph-parallel computation.