Apache Spark
A distributed processing system utilized for big data workloads, supporting batch processing, stream processing and machine learning, with in-memory caching and optimized query execution for large datasets
&
| + | Swift Processing  | Achieves high data processing speed by reducing read-write to disk | 
|---|---|---|
| + | Dynamic Nature  | Supports the development of parallel applications with 80 high-level operators | 
| + | In-Memory Computation  | Increases processing speed by caching data, avoiding disk fetch each time | 
| + | Reusability  | Allows code reuse for batch-processing, stream joining, and ad-hoc queries | 
| + | Fault Tolerance  | Utilizes RDD abstraction to handle worker node failures, reducing data loss | 
| + | Real-Time Stream Processing  | Handles real-time data, overcoming Hadoop MapReduce limitations | 
| + | Lazy Evaluation  | Transforms in Spark RDD are lazy, increasing system efficiency | 
| + | Multiple Language Support  | Provides APIs in Java, Scala, Python, and R | 
| + | DAG Execution Engine  | Facilitates in-memory computation and acyclic data flow | 
| + | SQL Interface  | Offers SQL2003-compliant interface for querying data | 
| + | Machine Learning Pipelines  | Enables easy implementation of feature extraction and transformations | 
| + | Graph Processing  | Supports analysis techniques for data at scale | 
| + | Structured Streaming  | High-level API for creating infinite streaming dataframes and datasets | 
| + | Resilient Distributed Dataset (RDD)  | Represents an immutable collection of objects across a cluster | 
| + | Distributed Processing  | Distributes data processing tasks across multiple computers | 
| + | Dataframe Approach  | Borrowed from R and Python for processing structured data | 
| + | Unified Analytics Engine  | Large-scale data processing with built-in modules for SQL and ML | 
| + | Lightweight Engine  | Unified analytics engine for large-scale data processing | 
| + | Runs Everywhere  | Compatible with Hadoop, Mesos, Kubernetes, standalone, or cloud | 
| + | Data Integration  | Reduces cost and time for ETL processes | 
| + | Interactive Analytics  | Generates rapid responses for handling data interactively | 
| + | Machine Learning at Scale  | Stores data in memory for quick machine-learning algorithm processing | 
| + | Cluster Management  | Can run in stand-alone mode or with robust resource management systems | 
| - | Limited Configurability and Optimization  | Lacks automatic optimization and requires manual configuration for efficient resource usage. This can be complex for new users. | 
| - | Data Management  | Doesn’t have its own file management system and struggles with small files, impacting efficiency. | 
| - | Limited Real-time Processing  | While powerful for batch processing, it has limitations for truly real-time data pipelines. | 
| - | High Resource Requirements  | In-memory processing can be memory-intensive, leading to high hardware costs and potential latency issues. | 
| - | Integration Complexity  | Integrating with other systems can be challenging, requiring careful planning. | 
| - | Scalability with Effort  | While scalable, it requires proper resource allocation and management for optimal performance. | 
| - | Fewer Algorithms  | Compared to other platforms, it offers a limited set of machine learning algorithms. | 
System Requirements
| # | Minimum | 
|---|---|
| 1 | 8-16 cores per machine  | 
| 2 | 8 GB to hundreds of GBs  | 
| 3 | 4-8 disks per node  | 
| 4 | 10 Gigabit or higher network  | 
| # | Minimum | 
|---|---|
| 1 | 8-16 cores per machine  | 
| 2 | 8 GB to hundreds of GBs  | 
| 3 | 4-8 disks per node  | 
| 4 | 10 Gigabit or higher network  | 
Ratings
4.085
| PAT RESEARCH | 7.710 based on professional's opinion  | 
|---|---|
| PAT RESEARCH | 8.210 based on 2 reviews  | 
| TrustRadius | 8.610 based on 101 reviews  | 
Written in
Scala, Java, Python, R
Initial Release
26 May 2014
Repository
License
Categories
Alternatives
Data Analytics
Machine Learning
Cloud Computing
Machine Learning
Cloud Computing
Notes
Libraries:
- Spark SQL is Apache Spark’s module for working with structured data.
 - Spark Connect is a protocol that specifies how a client application can communicate with a remote Spark Server.
 - Spark Streaming makes it easy to build scalable fault-tolerant streaming applications.
 - MLlib is Apache Spark’s scalable machine learning library.
 - GraphX is Apache Spark’s API for graphs and graph-parallel computation.