Apache Spark logo Apache Spark logo background glow

Apache Spark

A distributed processing system utilized for big data workloads, supporting batch processing, stream processing and machine learning, with in-memory caching and optimized query execution for large datasets

&

+Swift ProcessingAchieves high data processing speed by reducing read-write to disk
+Dynamic NatureSupports the development of parallel applications with 80 high-level operators
+In-Memory ComputationIncreases processing speed by caching data, avoiding disk fetch each time
+ReusabilityAllows code reuse for batch-processing, stream joining, and ad-hoc queries
+Fault ToleranceUtilizes RDD abstraction to handle worker node failures, reducing data loss
+Real-Time Stream ProcessingHandles real-time data, overcoming Hadoop MapReduce limitations
+Lazy EvaluationTransforms in Spark RDD are lazy, increasing system efficiency
+Multiple Language SupportProvides APIs in Java, Scala, Python, and R
+DAG Execution EngineFacilitates in-memory computation and acyclic data flow
+SQL InterfaceOffers SQL2003-compliant interface for querying data
+Machine Learning PipelinesEnables easy implementation of feature extraction and transformations
+Graph ProcessingSupports analysis techniques for data at scale
+Structured StreamingHigh-level API for creating infinite streaming dataframes and datasets
+Resilient Distributed Dataset (RDD)Represents an immutable collection of objects across a cluster
+Distributed ProcessingDistributes data processing tasks across multiple computers
+Dataframe ApproachBorrowed from R and Python for processing structured data
+Unified Analytics EngineLarge-scale data processing with built-in modules for SQL and ML
+Lightweight EngineUnified analytics engine for large-scale data processing
+Runs EverywhereCompatible with Hadoop, Mesos, Kubernetes, standalone, or cloud
+Data IntegrationReduces cost and time for ETL processes
+Interactive AnalyticsGenerates rapid responses for handling data interactively
+Machine Learning at ScaleStores data in memory for quick machine-learning algorithm processing
+Cluster ManagementCan run in stand-alone mode or with robust resource management systems
-Limited Configurability and OptimizationLacks automatic optimization and requires manual configuration for efficient resource usage. This can be complex for new users.
-Data ManagementDoesn’t have its own file management system and struggles with small files, impacting efficiency.
-Limited Real-time ProcessingWhile powerful for batch processing, it has limitations for truly real-time data pipelines.
-High Resource RequirementsIn-memory processing can be memory-intensive, leading to high hardware costs and potential latency issues.
-Integration ComplexityIntegrating with other systems can be challenging, requiring careful planning.
-Scalability with EffortWhile scalable, it requires proper resource allocation and management for optimal performance.
-Fewer AlgorithmsCompared to other platforms, it offers a limited set of machine learning algorithms.

Platform

Social

System Requirements

Version ↓
#Minimum
1
8-16 cores per machine
2
8 GB to hundreds of GBs
3
4-8 disks per node
4
10 Gigabit or higher network
#Minimum
1
8-16 cores per machine
2
8 GB to hundreds of GBs
3
4-8 disks per node
4
10 Gigabit or higher network

Ratings

4.08
5

PAT RESEARCH
7.7
10
based on professional's opinion
PAT RESEARCH
8.2
10
based on 2 reviews
TrustRadius
8.6
10
based on 101 reviews

Written in

Scala, Java, Python, R

Initial Release

26 May 2014


Notes

Libraries:

  1. Spark SQL is Apache Spark’s module for working with structured data.
  2. Spark Connect is a protocol that specifies how a client application can communicate with a remote Spark Server.
  3. Spark Streaming makes it easy to build scalable fault-tolerant streaming applications.
  4. MLlib is Apache Spark’s scalable machine learning library.
  5. GraphX is Apache Spark’s API for graphs and graph-parallel computation.