Step-by-Step Guide: Handling Avro File Format in Hive

9/11/2025

Avro File Format in Hive

Go Back

Step-by-Step Guide: Handling Avro File Format in Hive

Handling Avro file format in Hive involves using AvroSerDe (Serializer/Deserializer) to allow Hive to read and write data in Avro format. This enables you to leverage Avro's schema evolution capabilities and efficient binary serialization within your Hive environment.
 Avro File Format in Hive

Step 1: What is Avro Format in Hive?

  • Avro is a row-based storage format that uses JSON-like schemas for data serialization.

  • It supports schema evolution, making it ideal when the data structure changes over time.

Benefits:

  • Compact binary format → saves storage space

  • Self-describing (schema embedded)

  • Supports schema evolution

  • Interoperable with many big data tools (Hive, Pig, Spark)


Step 2: Creating an Avro Table Using Avro Schema

You can create a Hive table using an external Avro schema file.

CREATE TABLE employees_avro
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES (
  'avro.schema.url'='hdfs:///user/hive/schemas/employees.avsc'
);
  • employees.avsc is an Avro schema file stored on HDFS.

Sample Avro Schema (employees.avsc):

{
  "type": "record",
  "name": "Employee",
  "fields": [
    {"name": "id", "type": "int"},
    {"name": "name", "type": "string"},
    {"name": "department", "type": "string"},
    {"name": "salary", "type": "double"}
  ]
}

Step 3: Creating an Avro Table Without External Schema

Alternatively, you can define the schema inline.

CREATE TABLE employees_avro_inline (
  id INT,
  name STRING,
  department STRING,
  salary DOUBLE
)
STORED AS AVRO;
  • Hive internally creates an Avro schema for the table.


Step 4: Loading Data into Avro Table

You can load Avro data from HDFS.

LOAD DATA INPATH '/user/hive/input/employees.avro'
INTO TABLE employees_avro;

Step 5: Converting Text Data to Avro Format (CTAS)

Use Create Table As Select (CTAS) to convert existing text data to Avro.

CREATE TABLE employees_avro
STORED AS AVRO
AS
SELECT * FROM employees_text;

Step 6: Querying Avro Tables

You can query Avro tables like any other Hive table.

SELECT name, department FROM employees_avro WHERE salary > 50000;

Step 7: Evolving Avro Schemas in Hive

If your schema changes (like adding new columns):

  1. Update your .avsc schema file.

  2. Update the table property:

ALTER TABLE employees_avro
SET TBLPROPERTIES ('avro.schema.url'='hdfs:///user/hive/schemas/employees_v2.avsc');
  • Avro can handle new fields with default values.


✅ Best Practices

  • Use Avro when schema evolution is expected.

  • Store schema in a version-controlled location (like HDFS or Git).

  • Always validate .avsc schema before use.

  • Combine Avro with partitioning for better performance.

  • For analytics-heavy workloads, convert Avro to ORC/Parquet after ingestion.


This guide helps you handle Avro files in Hive efficiently, with schema evolution support and smooth integration.