Apache Spark logo Apache Spark logo background glow

Apache Spark

A distributed processing system utilized for big data workloads, supporting batch processing, stream processing and machine learning, with in-memory caching and optimized query execution for large datasets

&

+
Swift Processing
Achieves high data processing speed by reducing read-write to disk
+
Dynamic Nature
Supports the development of parallel applications with 80 high-level operators
+
In-Memory Computation
Increases processing speed by caching data, avoiding disk fetch each time
+
Reusability
Allows code reuse for batch-processing, stream joining, and ad-hoc queries
+
Fault Tolerance
Utilizes RDD abstraction to handle worker node failures, reducing data loss
+
Real-Time Stream Processing
Handles real-time data, overcoming Hadoop MapReduce limitations
+
Lazy Evaluation
Transforms in Spark RDD are lazy, increasing system efficiency
+
Multiple Language Support
Provides APIs in Java, Scala, Python, and R
+
DAG Execution Engine
Facilitates in-memory computation and acyclic data flow
+
SQL Interface
Offers SQL2003-compliant interface for querying data
+
Machine Learning Pipelines
Enables easy implementation of feature extraction and transformations
+
Graph Processing
Supports analysis techniques for data at scale
+
Structured Streaming
High-level API for creating infinite streaming dataframes and datasets
+
Resilient Distributed Dataset (RDD)
Represents an immutable collection of objects across a cluster
+
Distributed Processing
Distributes data processing tasks across multiple computers
+
Dataframe Approach
Borrowed from R and Python for processing structured data
+
Unified Analytics Engine
Large-scale data processing with built-in modules for SQL and ML
+
Lightweight Engine
Unified analytics engine for large-scale data processing
+
Runs Everywhere
Compatible with Hadoop, Mesos, Kubernetes, standalone, or cloud
+
Data Integration
Reduces cost and time for ETL processes
+
Interactive Analytics
Generates rapid responses for handling data interactively
+
Machine Learning at Scale
Stores data in memory for quick machine-learning algorithm processing
+
Cluster Management
Can run in stand-alone mode or with robust resource management systems
-
Limited Configurability and Optimization
Lacks automatic optimization and requires manual configuration for efficient resource usage. This can be complex for new users.
-
Data Management
Doesn’t have its own file management system and struggles with small files, impacting efficiency.
-
Limited Real-time Processing
While powerful for batch processing, it has limitations for truly real-time data pipelines.
-
High Resource Requirements
In-memory processing can be memory-intensive, leading to high hardware costs and potential latency issues.
-
Integration Complexity
Integrating with other systems can be challenging, requiring careful planning.
-
Scalability with Effort
While scalable, it requires proper resource allocation and management for optimal performance.
-
Fewer Algorithms
Compared to other platforms, it offers a limited set of machine learning algorithms.

Platform

Desktop
Language
SQLScalaRPythonJava

Social

System Requirements

#Minimum
1
8-16 cores per machine
2
8 GB to hundreds of GBs
3
4-8 disks per node
4
10 Gigabit or higher network
#Minimum
1
8-16 cores per machine
2
8 GB to hundreds of GBs
3
4-8 disks per node
4
10 Gigabit or higher network

Ratings

4.08
5

PAT RESEARCH
7.7
10
based on professional's opinion
PAT RESEARCH
8.2
10
based on 2 reviews
TrustRadius
8.6
10
based on 101 reviews

Developer

Written in

Scala, Java, Python, R

Initial Release

26 May 2014

Repository

License

Categories


Notes

Libraries:

  1. Spark SQL is Apache Spark’s module for working with structured data.
  2. Spark Connect is a protocol that specifies how a client application can communicate with a remote Spark Server.
  3. Spark Streaming makes it easy to build scalable fault-tolerant streaming applications.
  4. MLlib is Apache Spark’s scalable machine learning library.
  5. GraphX is Apache Spark’s API for graphs and graph-parallel computation.