Apache Spark adds a much needed spark to Big Data processing

Apache Spark has been the new kid on the block that is now being touted as the next big thing in Big Data. It is the largest open source project in data processing and comes equipped with features that make it fast, easy to use and make it a unified engine. From the point of inception, Spark has taken followers in big companies such as Yahoo, Amazon, eBay, Groupon etc. on a massive scale. It has in a short span of time become the largest open source community in Big Data, with over 750 contributors from 200+ organizations.

Spark is a framework that enables parallel, distributed data processing. It offers a simple programming abstraction that provides powerful cache and persistence capabilities. Its framework can be deployed through Apache Mesos, Apache Hadoop via Yarn, or Spark’s own cluster manager. It also serves as a foundation for additional data processing frameworks such as Shark, which provides SQL functionality for Hadoop.

Spark is an excellent tool for iterative processing of large datasets. One way Spark is suited for this type of processing is through its Resilient Distributed Dataset (RDD). By using RDDs, programmers can pin their large data sets to memory, thereby supporting high-performance, iterative processing. Compared to reading a large data set from disk for each iteration of processing the, in-memory solution is obviously much faster. Spark enables applications in Hadoop clusters to run up to 100x faster in memory, and 10x faster even when running on disk. Spark makes it possible by reducing number of read/write to disc.

Developers can use the Spark framework via several programming languages including Java, Scala, and Python, enabling them to create and run their applications on their familiar programming languages and making it easy to build parallel apps. It comes with a built-in set of over 80 high-level operators. Spark supports SQL queries, streaming data, and complex analytics such as machine learning and graph algorithms out-of-the-box. Not only that, users can combine all these capabilities seamlessly in a single workflow.

Spark has seen implementation in standalone use cases over Hadoop such as in iterative algorithms in machine learning, interactive data mining and data processing, stream processing and sensor data processing. Spark is very easy to get started writing powerful Big Data applications. So go on, take your turn and become a master Spark developer with our all-inclusive live online course Big Data with Apache Spark using Scala.


866 Total Views 1 Views Today


Content developer. Writer. Dreamer.