What is Hadoop?
Apache Hadoop is an open source project governed by the Apache So7ware Founda:on (ASF) that allows you to gain insight from massive amounts of structured and unstructured data quickly and without significant investment.
Hadoop is designed to run on commodity hardware and can scale up or down without system interrup:on. It consists of three main func:ons: storage, processing and resource management.
Core services on Hadoop
MapReduce: MapReduce is a framework for wri:ng applica:ons that process large amounts of structured and unstructured data in
parallel across a cluster of several machines in a reliable and fault-tolerant.
HDFS: Hadoop Distributed File System is a java-based file system that provides scalable and reliable data storage for large group of clusters.
Hadoop Yarn: Yarn is a next genera:on framework for Hadoop Data processing extending MapReduce capabili:es by suppor:ng non- MapReduce workloads associated with other programming models.
Apache Tez: Tez generalizes the MapReduce paradigm to a more powerful framework for execu:ng a complex DAG (directed acyclic graph) of tasks for near real-:me big data processing
Hadoop Data Services
Apache Pig: Its plaPorm for processing and analyzing large data sets.
Apache Hbase: A column-oriented No SQL data storage system that provides random real-:me read/write access to big data for user applica:ons.
Apache Hive: Built on the MapReduce framework, Hive is a data warehouse that enables easy data summariza:on and add-hoc queries via SQL-like interface for large datasets stored in HDFS.
Apache Flume: Allows efficiently aggrega:ng and moving large amounts of log data from many different sources to Hadoop.
Apache Mahout: Apache Mahout scalable machine learning algorithms for hadoop, which aids with data science for clustering, classifica:on and batch based collabora:ve filtering.
Apache Accumulo : Accumulo is a high performance data storage and retrieval system with cell-level access control. It is a scalable implementa:on of Google’s Big Table design that works on top of Apache Hadoop and Apache ZooKeeper.
Apache Storm : Storm is a distributed real-:me computa:on system for processing fast, large streams of data adding reliable real-:me data processing capabili:es to Apache Hadoop 2.x.
Apache Catalog : A table and metadata management service that provides a centralized way for data processing systems to understand the structure and loca:on of the data stored within Apache Hadoop
Apache Sqoop : Sqoop is a tool that speeds and eases movement of data in and out of Hadoop. It provides a reliable parallel load for various, popular enterprise data sources.
Hadoop OperaConal Services
Apache Zookeeper: A highly available system for coordina:ng distribu:ng processes.
Apache Falcon: Falcon is a data management framework for simplifying data lifecycle management and processing pipelines on Apache
Apache Ambari: Open source installa:on lifecycle management, administra:on, and monitoring system for Apache Hadoop Clusters. Apache knox: “Knox” gateway is a system that provides a single point of authen:ca:on and access for Apache Hadoop services in a cluster.
Apache Oozie: Oozie Java web applica:on used to schedule Apache Hadoop Jobs. Oozie combines mul:ple jobs sequen:ally into one logical unit of work.
What Hadoop can, and can’t do
What Hadoop can’t do
You can’t use Hadoop for
- Structured data
- Transactional data
You can use Hadoop for
- Big Data