top-apache-hive-interview-questions
admin
#p-apache-hive-interview-questions
1. What is Hive, and how does it work? Hive is a data warehouse system built on top of Hadoop for querying and managing large datasets. It uses HiveQL, a SQL-like query language, for querying data stored in HDFS.
2. What are the key features of Hive?
3. What is HiveQL? HiveQL is a query language similar to SQL used to query and analyze data stored in Hive tables.
4. What is the difference between Hive and HBase? Hive is used for batch processing and analytics, while HBase is used for real-time querying and transactional processing.
5. What are Hive's data storage limitations?
6. What is a Hive Metastore? Metastore stores metadata about Hive tables, partitions, and other data structures.
7. Explain Hive partitions. Partitions in Hive allow the division of a table into logical subparts based on column values, improving query performance.
8. What is bucketing in Hive? Bucketing groups data into fixed-size buckets based on hash functions for efficient querying and joins.
9. What is the difference between internal and external tables in Hive? Internal tables manage both metadata and data, while external tables manage only metadata and keep data outside Hive.
10. How does Hive handle schema evolution? Hive allows schema changes such as adding columns without affecting existing data.
11. What are the different file formats supported by Hive?
12. How does Hive optimize query execution? Hive uses techniques like MapReduce, Tez, and cost-based optimizations to improve query execution.
13. What is the role of a SerDe in Hive? SerDe (Serializer/Deserializer) is responsible for reading and writing data in Hive tables.
14. What is a UDF in Hive? A User-Defined Function allows custom logic to process data during query execution.
15. Can Hive handle streaming data? Hive is not designed for streaming data. Tools like Kafka and Spark Streaming are better suited for such tasks.
16. How can you optimize Hive queries?
17. What is vectorization in Hive? Vectorization processes rows in batches, improving query performance for large datasets.
18. What is dynamic partitioning in Hive? Dynamic partitioning allows creating partitions during query execution based on the data.
19. What is the difference between static and dynamic partitioning? Static partitions are predefined, while dynamic partitions are generated at runtime.
20. How to avoid small files in Hive? Use CombineHiveInputFormat or tools like Hadoop DistCp to merge files before loading them into Hive.
21. How do you debug a Hive query?
22. What is the purpose of the Hive CLI? The Hive CLI allows users to execute Hive queries interactively or in batch mode.
23. What are Hive logs, and where can you find them? Hive logs capture query execution details and are stored in Hadoop job logs or local directories.
24. Why might Hive queries run slowly?
25. How can you improve join performance in Hive? Use map-side joins or bucketed map joins for efficient data processing.
26. What are the main components of Hive architecture?
27. What is the function of the Hive driver? The Hive driver manages query execution and communicates between the user and the execution engine.
28. Explain the role of the execution engine in Hive. The execution engine processes the query and translates it into MapReduce or Tez jobs.
29. What is a semantic analyzer in Hive? The semantic analyzer validates query correctness and prepares the execution plan.
30. How does Hive integrate with Hadoop? Hive uses HDFS for storage and MapReduce or Tez for processing queries.
31. How do you load data into a Hive table? Data can be loaded using LOAD DATA
or INSERT INTO
commands.
32. How to write a simple HiveQL query?
SELECT * FROM table_name WHERE column_name = 'value';
33. What are complex data types in Hive?
34. How can you access subdirectories recursively in Hive? Set the properties:
SET mapred.input.dir.recursive=true;
SET hive.mapred.supports.subdirectories=true;
35. What is the difference between HiveQL and SQL? HiveQL is designed for analytics and querying large datasets, while SQL is more suited for transactional databases.
36. Can Hive handle unstructured data? Hive is better suited for structured and semi-structured data.
37. What are ACID transactions in Hive? Hive supports ACID transactions for updates and deletes in tables with ORC format.
38. How does Hive handle NULL values? Hive treats NULL and null values as equivalent.
39. What is a lateral view in Hive? A lateral view allows splitting complex data types like arrays into rows for easier querying.
40. How can you use Hive with Spark? Hive can be integrated with Spark for faster query execution using the HiveContext class.
41. What is the difference between Hive and Spark SQL? Hive is slower and uses MapReduce, while Spark SQL is faster and uses in-memory processing.
42. What is the role of ORC in Hive? ORC (Optimized Row Columnar) improves query performance and compression.
43. How does Hive handle schema-on-read? Hive reads schema metadata from the Metastore without loading the data.
44. What is HCatalog in Hive? HCatalog is a data storage management layer that allows seamless data access between tools like Pig and MapReduce.
45. What is Tez in Hive? Tez is a faster execution engine used in Hive for DAG-based processing.
46. How do you perform data partitioning in Hive? Partition data using the PARTITIONED BY
clause while creating the table.
47. Can Hive work without Hadoop? No, Hive depends on Hadoop for storage and processing.
48. How can you use custom SerDe in Hive? Specify the SerDe class while creating a table using ROW FORMAT
.
49. What is the use of EXPLAIN
in Hive? EXPLAIN
provides the execution plan of a Hive query.
50. What are windowing functions in Hive? Windowing functions perform calculations across a set of rows related to the current row, like ranking or aggregation.