Apache Spark
A distributed processing system utilized for big data workloads, supporting batch processing, stream processing and machine learning, with in-memory caching and optimized query execution for large datasets
&
+ | Swift Processing | Achieves high data processing speed by reducing read-write to disk |
---|---|---|
+ | Dynamic Nature | Supports the development of parallel applications with 80 high-level operators |
+ | In-Memory Computation | Increases processing speed by caching data, avoiding disk fetch each time |
+ | Reusability | Allows code reuse for batch-processing, stream joining, and ad-hoc queries |
+ | Fault Tolerance | Utilizes RDD abstraction to handle worker node failures, reducing data loss |
+ | Real-Time Stream Processing | Handles real-time data, overcoming Hadoop MapReduce limitations |
+ | Lazy Evaluation | Transforms in Spark RDD are lazy, increasing system efficiency |
+ | Multiple Language Support | Provides APIs in Java, Scala, Python, and R |
+ | DAG Execution Engine | Facilitates in-memory computation and acyclic data flow |
+ | SQL Interface | Offers SQL2003-compliant interface for querying data |
+ | Machine Learning Pipelines | Enables easy implementation of feature extraction and transformations |
+ | Graph Processing | Supports analysis techniques for data at scale |
+ | Structured Streaming | High-level API for creating infinite streaming dataframes and datasets |
+ | Resilient Distributed Dataset (RDD) | Represents an immutable collection of objects across a cluster |
+ | Distributed Processing | Distributes data processing tasks across multiple computers |
+ | Dataframe Approach | Borrowed from R and Python for processing structured data |
+ | Unified Analytics Engine | Large-scale data processing with built-in modules for SQL and ML |
+ | Lightweight Engine | Unified analytics engine for large-scale data processing |
+ | Runs Everywhere | Compatible with Hadoop, Mesos, Kubernetes, standalone, or cloud |
+ | Data Integration | Reduces cost and time for ETL processes |
+ | Interactive Analytics | Generates rapid responses for handling data interactively |
+ | Machine Learning at Scale | Stores data in memory for quick machine-learning algorithm processing |
+ | Cluster Management | Can run in stand-alone mode or with robust resource management systems |
- | Limited Configurability and Optimization | Lacks automatic optimization and requires manual configuration for efficient resource usage. This can be complex for new users. |
- | Data Management | Doesn’t have its own file management system and struggles with small files, impacting efficiency. |
- | Limited Real-time Processing | While powerful for batch processing, it has limitations for truly real-time data pipelines. |
- | High Resource Requirements | In-memory processing can be memory-intensive, leading to high hardware costs and potential latency issues. |
- | Integration Complexity | Integrating with other systems can be challenging, requiring careful planning. |
- | Scalability with Effort | While scalable, it requires proper resource allocation and management for optimal performance. |
- | Fewer Algorithms | Compared to other platforms, it offers a limited set of machine learning algorithms. |
System Requirements
Version ↓
# | Minimum |
---|---|
1 | 8-16 cores per machine |
2 | 8 GB to hundreds of GBs |
3 | 4-8 disks per node |
4 | 10 Gigabit or higher network |
# | Minimum |
---|---|
1 | 8-16 cores per machine |
2 | 8 GB to hundreds of GBs |
3 | 4-8 disks per node |
4 | 10 Gigabit or higher network |
Written in
Scala, Java, Python, R
Initial Release
26 May 2014
Alternatives
Data Analytics
KNIME Analytics Platform
Machine Learning
Massive Online Analysis TensorFlow Apache Mahout Apache MXNet Apache SystemDS Eclipse Deeplearning4j MALLET mlpack OpenCV Orange PyTorch scikit-learn The Microsoft Cognitive Toolkit Torch Weka Yooreeka
Cloud Computing
Pulumi IaC Infracost Terraform Velero Apache Hadoop Apache Mahout
KNIME Analytics Platform
Machine Learning
Massive Online Analysis TensorFlow Apache Mahout Apache MXNet Apache SystemDS Eclipse Deeplearning4j MALLET mlpack OpenCV Orange PyTorch scikit-learn The Microsoft Cognitive Toolkit Torch Weka Yooreeka
Cloud Computing
Pulumi IaC Infracost Terraform Velero Apache Hadoop Apache Mahout
Notes
Libraries:
- Spark SQL is Apache Spark’s module for working with structured data.
- Spark Connect is a protocol that specifies how a client application can communicate with a remote Spark Server.
- Spark Streaming makes it easy to build scalable fault-tolerant streaming applications.
- MLlib is Apache Spark’s scalable machine learning library.
- GraphX is Apache Spark’s API for graphs and graph-parallel computation.