What is the difference between SORT BY and ORDER BY in Hive?

10/26/2022

ORDER BY vs SORT BY in Hive – Examples and Performance Differences

Go Back

ORDER BY and SORT BY Clause in Hive: Complete Guide with Examples

Introduction

In Apache Hive, sorting data efficiently is crucial for big data processing. Hive provides two main clauses for sorting: ORDER BY and SORT BY. While they might seem similar, they behave differently under the hood and significantly impact query performance.

This tutorial will explain ORDER BY vs SORT BY in Hive, with step-by-step examples, code snippets, and performance insights.


ORDER BY vs SORT BY in Hive – Examples and Performance Differences

Creating a Table in Hive

Before exploring the differences, let's create a sample table named DeveloperIndian to demonstrate:

create table DeveloperIndian(
    Id int,
    Name string,
    Salary float,
    Department string
)    
row format delimited    
fields terminated by ',';

Loading Data into the Table

Load sample developer data into the table:

load data local inpath '/home/codegyani/hive/developerIndian_data'
into table DeveloperIndian;

Understanding ORDER BY in Hive

The ORDER BY clause performs a total ordering of the query result set. This means:

  • All the data is passed through a single reducer.

  • Ensures a globally sorted output.

  • Can be slow for large datasets because of the single reducer bottleneck.

Example:

select *
from DeveloperIndian
order by Salary desc;

Output:
The entire dataset is sorted by salary in descending order, producing a globally sorted result.


Understanding SORT BY in Hive

The SORT BY clause provides a local ordering:

  • Data is sorted within each reducer.

  • Multiple reducers can work in parallel.

  • Does not guarantee global sorting.

  • Better performance for large datasets.

Example:

select *
from DeveloperIndian
sort by Salary desc;

Output:
Each reducer outputs sorted data, but the overall result set is not globally sorted.


Key Differences Between ORDER BY and SORT BY

FeatureORDER BYSORT BY
Output SortingGlobally sortedLocally sorted within reducers
Reducers UsedSingle reducerMultiple reducers
PerformanceSlower on large datasetsFaster on large datasets
Use CaseWhen a complete sorted result is requiredWhen partial/local sorting is fine

Performance Tip

  • Use ORDER BY only for small datasets where global sorting is necessary.

  • For large-scale data processing, prefer SORT BY to achieve faster execution without reducer bottlenecks.


Conclusion

Both ORDER BY and SORT BY are essential Hive clauses for data sorting.

  • Use ORDER BY when you need the entire dataset sorted globally.

  • Use SORT BY for large datasets when local sorting suffices, improving performance.

 

By choosing the right clause, you can optimize query speed and handle big data efficiently in Apache Hive.

 

Table of content