What is the difference between SORT BY and ORDER BY in Hive?
ORDER BY vs SORT BY in Hive – Examples and Performance Differences
Introduction
In Apache Hive, sorting data efficiently is crucial for big data processing. Hive provides two main clauses for sorting: ORDER BY and SORT BY. While they might seem similar, they behave differently under the hood and significantly impact query performance.
This tutorial will explain ORDER BY vs SORT BY in Hive, with step-by-step examples, code snippets, and performance insights.
Before exploring the differences, let's create a sample table named DeveloperIndian
to demonstrate:
create table DeveloperIndian(
Id int,
Name string,
Salary float,
Department string
)
row format delimited
fields terminated by ',';
Load sample developer data into the table:
load data local inpath '/home/codegyani/hive/developerIndian_data'
into table DeveloperIndian;
The ORDER BY clause performs a total ordering of the query result set. This means:
All the data is passed through a single reducer.
Ensures a globally sorted output.
Can be slow for large datasets because of the single reducer bottleneck.
select *
from DeveloperIndian
order by Salary desc;
Output:
The entire dataset is sorted by salary in descending order, producing a globally sorted result.
The SORT BY clause provides a local ordering:
Data is sorted within each reducer.
Multiple reducers can work in parallel.
Does not guarantee global sorting.
Better performance for large datasets.
select *
from DeveloperIndian
sort by Salary desc;
Output:
Each reducer outputs sorted data, but the overall result set is not globally sorted.
Feature | ORDER BY | SORT BY |
---|---|---|
Output Sorting | Globally sorted | Locally sorted within reducers |
Reducers Used | Single reducer | Multiple reducers |
Performance | Slower on large datasets | Faster on large datasets |
Use Case | When a complete sorted result is required | When partial/local sorting is fine |
Use ORDER BY only for small datasets where global sorting is necessary.
For large-scale data processing, prefer SORT BY to achieve faster execution without reducer bottlenecks.
Both ORDER BY and SORT BY are essential Hive clauses for data sorting.
Use ORDER BY when you need the entire dataset sorted globally.
Use SORT BY for large datasets when local sorting suffices, improving performance.
By choosing the right clause, you can optimize query speed and handle big data efficiently in Apache Hive.