In 2014 we built a commercial Apache Hadoop Distribution for one of the Fortune 500 companies. The goal was to create an equivalent of Cloudera CDH with several added-value features: drivers, profilers, compression codecs, etc.
The distribution had the following key components:
Hadoop services deployment and provisioning to a cluster up to 2000 nodes
System health monitoring:
Collection, aggregation and storage of >200 various kinds of metrics both Hadoop and system resources from various sources over a cluster
Web console with enhanced metric data visualization, including interactive charts with pan/zoom/scale for live & historical data and heat map reflecting physical structure of the cluster (nodes, racks, rooms etc.)
Configurable alerts via e-mail and SNMP traps
Performance profiler for workflow-driven application hotspot breakdown:
collection of execution metrics for core Hadoop task phases (map, spill, shuffle, sort, reduce)
visual profiling charts to pinpoint performance problems and fine tune Hadoop configuration settings such as inefficient task scheduling config, inappropriate/odd compression, as well as system/hardware problems
visual browser and editor for unlimited size tables stored in HBase
table schema management rendering multiple data types supported by HBase
Amazon EMR integration
Java, Scala, Spring, Akka, AngularJS, Hadoop
Production clusters have been deployed for multiple clients of our customer, including AT&T, Cisco, HP, and China Mobile Communications Corporation.