ACID Transactions in Hive: A Complete Guide
Diagram explaining ACID transactions in Hive
Introduction
Apache Hive has been widely used for batch processing and analytical queries on large datasets stored in Hadoop. However, traditional Hive lacked support for row-level operations such as INSERT, UPDATE, and DELETE, which are critical for transactional workloads. To overcome this limitation, Hive introduced ACID (Atomicity, Consistency, Isolation, Durability) transactions, enabling it to handle both analytical and transactional use cases effectively.
With ACID support, Hive ensures:
Atomicity → Transactions are processed completely or not at all.
Consistency → Data transitions from one valid state to another.
Isolation → Multiple transactions execute independently without conflicts.
Durability → Once committed, data remains safe even after failures.
In this tutorial, we will explore ACID transactions in Hive, their configuration, usage examples, and best practices.
ACID transactions allow Hive tables to support row-level operations, making Hive suitable for OLTP (Online Transactional Processing) alongside OLAP (Online Analytical Processing).
This means you can now:
Insert new records
Update existing records
Delete unwanted records
Merge datasets efficiently
ACID support is available only for transactional tables in Hive.
By default, ACID transactions are disabled in Hive. To enable them, update the following configuration properties in hive-site.xml
or via Hive CLI:
SET hive.support.concurrency = true;
SET hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
SET hive.compactor.initiator.on = true;
SET hive.compactor.worker.threads = 1;
hive.support.concurrency
→ Enables concurrent transactions.
hive.txn.manager
→ Ensures Hive uses the transactional manager.
hive.compactor.*
→ Manages compaction for transactional tables.
To use ACID transactions, tables must be created as transactional:
CREATE TABLE employees (
emp_id INT,
emp_name STRING,
emp_salary DOUBLE
) CLUSTERED BY (emp_id) INTO 4 BUCKETS
STORED AS ORC
TBLPROPERTIES ('transactional' = 'true');
Requirements:
Tables must use ORC format.
Tables must be bucketed.
'transactional' = 'true'
property must be set.
INSERT INTO employees VALUES (101, 'Alice', 75000);
UPDATE employees SET emp_salary = 80000 WHERE emp_id = 101;
DELETE FROM employees WHERE emp_id = 101;
MERGE INTO employees e
USING new_data n
ON e.emp_id = n.emp_id
WHEN MATCHED THEN UPDATE SET e.emp_salary = n.emp_salary
WHEN NOT MATCHED THEN INSERT VALUES (n.emp_id, n.emp_name, n.emp_salary);
ACID operations generate multiple delta files, which can affect performance. Hive uses compaction to merge small files into bigger ones:
Minor Compaction → Merges delta files.
Major Compaction → Merges base and delta files.
Run compaction manually if needed:
ALTER TABLE employees COMPACT 'major';
Always use ORC format with bucketing for transactional tables.
Schedule regular compactions to improve query performance.
Avoid too many small transactions → batch inserts are better.
Use MERGE for efficient upserts instead of multiple updates.
Enable vectorized execution for faster performance with ORC.
ACID transactions in Hive bring strong transactional capabilities to big data environments. By enabling insert, update, delete, and merge operations, Hive can now handle both analytical (OLAP) and transactional (OLTP) workloads efficiently. Proper configuration and best practices ensure high performance and data reliability in large-scale production environments.