Pages

Nov 2, 2012

What is Greenplum HD ?


Greenplum HD is enterprise-ready Apache Hadoop from EMC that allows users to write distributed processing applications for large data sets across a cluster of commodity servers using a simple programming model. This framework automatically parallelizes Map Reduce jobs to handle data at scale, thereby eliminating the need for developers to write scalable and parallel algorithms. 




Greenplum HD is an open source Apache stack and includes the following components:

  • Hadoop Distributed File System (HDFS): File system that distributes files across the cluster. 
  • MapReduce: Framework for writing scalable data applications. 
  • Pig: Procedural language that abstracts lower level MapReduce. 
  • Hive: Data warehouse infrastructure built on top of Hadoop. 
  • HBase: Database for random, real time read/write access. 
  • Mahout: Scalable machine learning and data mining library. 
  • ZooKeeper: Hadoop centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. 
  • Spring Hadoop: Integrates with the Spring Framework to create and run Hadoop MapReduce, Hive, and Pig jobs as well as work with HDFS and HBase.
  • HVE (Hadoop Virtualization Extension): enables Hadoop components to become virtualization-aware which improves deployment topology, scheduler changes and resource utilization