DW: Introduction to Apache Hive
1. Introduction to Hive
Apache Hive is an open-source data warehousing system built on top of Hadoop. It translates structured and semi-structured data files stored in Hadoop files into a database table structure. It provides a SQL-like querying model known as Hive Query Language (HQL), allowing access and analysis of large datasets stored in Hadoop files.
The core of Hive involves transforming HQL into MapReduce programs, which are then submitted to the Hadoop cluster for execution. Facebook initially implemented Hive and later open-sourced it.
Why use Hive?
- Challenges with direct processing of data using Hadoop MapReduce
- High learning curve requiring proficiency in Java
- Complexity in implementing complex query logic using MapReduce
- Benefits of using Hive for data processing
- Offers a SQL-like interface, enabling rapid development (simple and user-friendly)
- Reduces the learning curve by avoiding direct MapReduce coding
- Supports custom functions for easy functionality extension
- Leverages Hadoop’s strength in storing and analyzing massive datasets
2. Hive Architecture and Components
User Interfaces
Include CLI, JDBC/ODBC, WebGUI. CLI (command-line interface) operates as a shell command line. The Thrift server in Hive allows external clients to interact with Hive via a network, similar to JDBC or ODBC protocols. WebGUI enables Hive access through a browser.Metadata Storage
Usually stored in relational databases such as MySQL/Derby. Hive’s metadata comprises table names, columns, partitions and their attributes, table properties (e.g., external table), and the directory where the table data resides.Driver
Comprises syntax parser, plan compiler, optimizer, and executor. It handles HQL query execution from lexical and syntax analysis to compilation, optimization, and query plan generation. The generated query plan is stored in HDFS and executed by the execution engine subsequently.Execution Engine
Hive doesn’t directly process data files but employs an execution engine. Currently, Hive supports three execution engines: MapReduce, Tez, and Spark.
3. Data Model
The data model serves as the framework to describe, organize, and manipulate data, presenting a description of the characteristics of real-world data.
In Hive, the data model exhibits resemblances to RDBMS table structures, yet it also encompasses its distinct model. The data within Hive is categorized into three primary granular levels:
- Tables: Basic organizational units storing data and defining its schema.
- Partitions: Logically segregate data within tables, optimizing query performance.
- Buckets: Divide tables into manageable parts based on hash functions, aiding efficient querying.
The data model offers flexibility in data organization and access.