Apache Spark logo

Apache Spark

Distributed general-purpose cluster-computing framework

Status |
  Get it    Visit
                   
</page-source>
Overview    Platform    Social    System Requirements    Ratings    Developer    Written in    Initial Release    Repository    License    Categories   

Overview

Apache Spark(TM) is an open-source distributed general-purpose cluster computing framework with (mostly) in-memory data processing engine that can do ETL, analytics, machine learning and graph processing on large volumes of data at rest (batch processing) or in motion (streaming processing) with rich concise high-level APIs for the programming languages: Scala, Python, Java, R, and SQL.

In contrast to Hadoop’s two-stage disk-based MapReduce computation engine, Spark’s multi-stage (mostly) in-memory computing engine allows for running most computations in memory, and hence most of the time provides better performance for certain applications, e.g. iterative algorithms or interactive data mining.
- Mastering Apache Spark by Jacek Laskowski

See Apache Hadoop.

Libraries:

  1. Spark SQL is Apache Spark's module for working with structured data.
  2. Spark Streaming makes it easy to build scalable fault-tolerant streaming applications.
  3. MLlib is Apache Spark's scalable machine learning library.
  4. GraphX is Apache Spark's API for graphs and graph-parallel computation.

News I Stack Overflow Q&A I Community/Mailing Lists I Documentation I FAQ I IRC

Platform

  

Social

System Requirements

#Minimum
1Hard Disk: 4-8 disks per node
2RAM: 8 GB to hundreds of GBs
3Network connection: 10 Gigabit or higher network
4Processor cores: 8-16 cores per machine

Ratings

4.08
5
TrustRadius: 8.6
10  based on 101 reviews

PAT RESEARCH: 7.7
10  based on professional's opinion

PAT RESEARCH: 8.2
10  based on 2 reviews

Developer

Matei Zaharia(OD) at UC, Berkley's AMPLab, Apache Software Foundation

Written in

Scala, Java, Python, R

Initial Release

26 May 2014

License

Apache v2