0% found this document useful (0 votes)
20 views139 pages

112 Q&a

Uploaded by

sarkhelatanu74
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views139 pages

112 Q&a

Uploaded by

sarkhelatanu74
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 139

Crack Data Engineering Interviews

Apache Hive
120+ Real-time Interview Questions
THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

1. What is Apache Hive?


• Open source software - reading, writing and managing large data set files in
HDFS/Hbase.
• It gives an SQL-like interface to query data
• Impala/Trino/Presto

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

2. What is the default database of Hive to store


its metadata?
• Derby
• Embeded metastore of Hive.
• We can configure to use any DB for metastore.

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

3. What is the default execution engine in


Hive?
• Map Reduce

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

4. How to change the execution engine in Hive?

• SET hive.execution.engine = tez;

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

5. Which component of Hive connects to


Hadoop cluster for query execution?
• Execution engine.

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

6. Which component of Hive converts SQL


query to jar file for execution?
• Execution engine.

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

7. How do you see the execution plan of a


query?
• Use EXPLAIN

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

8. How do you see the database in which you


are currently working on?
• Set hive.cli.print.current.db = true

• beeline --showDbInPrompt=true

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

9. What is the default storage location of Hive


tables?
• /user/hive/warehouse

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

10. What do you do to have a table which


should exists only for that particular Hive
session?
• Create Temporary table.

• Hive 0.14 onward supports temporary tables. You can use them as a
normal table within a user session. The location will be
/tmp/hive/<user>/*.

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

11. What are the limitations of Hive Temporary


tables?
• It is available only to that session where it is created.
• Automatically removed when session terminated.
• Partition columns option is not supported on temporary tables.
• Creating indexes on temporary table is not allowed.
• The user cannot access the permanent table with same name as
temporary tables during that session without dropping or
renaming the temporary table.

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

12. What are the different types of tables in


Hive?
• Temporary Table
• Managed Table
• External Table

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

13. Scenario: We need to store one of the feed data into


Hive table and expose it for consumption. What are the
steps you will take?
• Talk to source team and get schema information of the feed data like data
type, data size, volume of data etc.
• Talk to business team who is going to consume the data and find the search
criteria they will be using.
• Based on the above two create table with partitioning/bucketing/indexing
etc.

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

14. What are the default terminators in Hive?

• Field Terminator – ctrl + A


• Line Terminator - \n

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

15. How do you get the schema information of


tables in Hive?
• DESCRIBE FORMATTED <Table Name>

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

16. Can we use DESCRIBE for Database to


see the details?
• Yes. It can be used for Tables, databases and views.

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

17. How do you load data from Linux location


to Hive table location using Hive script?
• LOAD DATA LOCAL INPATH '/home/hduser/hive/data/txns' INTO
TABLE txnrecords;
• LOAD DATA LOCAL INPATH '/home/hduser/hive/data/txns'
OVERWRITE INTO TABLE txnrecords;

It copies the content from Linux to Hive table location

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

18. How do you load data from HDFS location


to Hive table location using Hive script?
• LOAD DATA INPATH '/user/hduser/txns1' INTO TABLE txnrecords;
• LOAD DATA INPATH '/user/hduser/txns1' OVERWRITE INTO
TABLE txnrecords;

It moves data from HDFS location to Hive table location.

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

19. How do you convert row of records into


single object in Hive?
• Use Collect_Set() or Collect_List()

Collect_Set() removes duplicates

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

20. How do you convert array of data into


individual rows?
• Explode – can’t use with other columns
• Lateral View Explode – can use other columns
• PosExplode - explode with index.

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

posexplode

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

Converting columns to rows (UNPIVOT) in hiveql

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

21. What are the different types of indexes


available in Hive?
• Compact Index – use in high cardinal column
• Bitmap – use in low cardinal column

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

22. What are the different types of partitions


available in Hive?
• Static – based on some external factors
• Dynamic – based on data
• set hive.exec.dynamic.partition=true;
• set hive.exec.dynamic.partition.mode=nonstrict;

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

Static Partition:
load data local inpath '/home/hduser/hive/student_details2' into table student partition(course= "hadoop");

Dynamic partition:
• load data local inpath '/home/hduser/hive/student_details' into table stud_demo;
• insert into student_part partition(course) select id, name, age, institute, course from stud_demo;

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

23. Where NULL and Empty values will be


stored in case of partitions?
• __HIVE_DEFAULT_PARTITIONS dir.

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

24. Does partition values are case sensitive?


• Yes

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

25. Can we do dynamic partition using LOAD


DATA command?
• NO

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

26. Whether metadata will be updated if we load


partition data using LOAD DATA command?
• Yes
• load data local inpath '/home/hduser/hive/student_details2' into table student
partition(course= "hadoop");

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

27. Can we do update and delete in Hive table?


• By default, NO.
• Yes, if transaction is enabled and table is in ORC format and bucketed.
• TBLPROPERTIES (‘transactional’=’true’)
• set hive.enforce.bucketing = true;
• Limitations:
• External tables are not supported.
• Only ORC format is supported.
• LOAD is not supported on ACID transactional Tables. Hence use INSERT INTO.
• On Transactional session, all operations are auto commit as BEGIN, COMMIT, and ROLLBACK are not yet
supported.

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

28. How partition helps in managing data?


• We can remove the portion data from table easily by just deleting the
partition.
• We can update the portion of data with updated value by just overwriting
partition folder data.

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

29. Scenario: How do you load only few fields


from file?
• It is a two step process
• Load all data to managed table – LOAD DATA LOCAL INPATH
• Do insert select with required fields from managed to External table.

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

30. Scenario: I manually created a partition folder and kept data


inside it. But when I query from hive table for that partition value, I
am not getting data. What should I do?

• Run msck repair or manually add the partition.


• It is Hive's metastore consistency check

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

31. Scenario: I deleted some of the partition folder manually and


then executed msck repair, will it remove the partition from
metadata?

• NO, we have to manually DROP the partition.

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

32. Scenario: I created a new partition folder with a name of non


partition column and running msck repair. What will happen, it will
update the metadata?

• No, it will throw error. (We can add only new value to the partition column.)

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

33. How do you choose partition type?


• Dynamic partition – to create partitions based on data.
• Static partition – to create partitions based on some external parameters.

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

34. In what scenario we can use ADD PARTITION


than MSCK REPAIR?
• In case of ADD PARTITION we can specify the location of partition value.
• Ex – ALTER TABLE emp ADD PARTITION (dept = ‘dev’) LOCATION
‘/user/hive/dept = ‘ACCT’

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

35. Can we perform DELETE and


TRUNCATE on Hive Table?
• Truncate is possible (only in Managed table)
• Delete is not possible by default. (transactional,ORC, bucketing)

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

36. What are the pros and cons of Hive


partitioning?
Pros Cons
• Improves search performance • It adds overhead on Namenode to
• Manages data efficiently maintain partition directory
information. If partitions are done
• Easy to update or delete portion of based on high cardinal column by
data mistake then it will add more
burden to Namenode.

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

37. How do you segregate or partition the data


if table has no low cardinal column?
• Use Bucketing

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

38. How many reducers will be called when we


insert data into Bucketed table?
• Number of reducers will be equal to number of buckets

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

39. How Bucketing works?


• It segregates the data by applying Mod operation with bucket number on
hash value of bucket column data.

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

40. In which bucket NULL and empty string


will go?
• It will be stored in bucket zero.

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

41. Why bucket join is faster than partition


join?
• In case of bucketing we will have almost equal number of data in each
bucket and only respective buckets are picked up and performed join
operations.

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

42. How do you export hive table data to Linux


location?
• Use INSERT LOCAL DIRECTORY
• INSERT LOCAL DIRECTORY ‘/home/hduser/data’ ROW FORMAT DELIMITED
FIELDS-TERMINATED-BY ‘,’ SEELCT * FROM TEMP;

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

43. Scenario: I have created more partition folders manually. What


is the easiest way to add those partition information in metastore?

• Msck repair.

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

44. Scenario: You have 10 columns in Hive table, how do


you export column 1,3,5 alone into Sqoop?

• It is a two step process


• Create a managed table with required columns and so insert select of required columns
from source table.
(OR)
INSERT DIRECTORY ‘/user/hduser/data’ ROW FORMAT DELIMITED FIELDS-
TERMINATED-BY ‘,’ SEELCT col1, col3, col5 FROM TEMP;
• Then do Sqoop Export from the directory. --export-dir

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

45. How do you run Hive script from Linux


terminal?
• Hive –f hiveFile.hql
• $ beeline --hivevar var1=value -f file.sql
• Hive –e “select * from custdb.sample”
• $ beeline -e "select current_date()"

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

46. What is the default file format in Hive?


• Text File.

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

47. What are the file formats supported by


Hive?
• Text File.
• SequenceFile.
• RCFile.
• Avro Files.
• ORC Files.
• Parquet.

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

48. What is SerDe in Hive?


• SerDe is short for Serializer, Deserializer
• Hive uses SerDe for storing in HDFS and retrieving data from HDFS.
• Default is LazySimpleTextSerDe
• Built-in SerDe
• Avro (Hive 0.9.1 and later)
• ORC (Hive 0.11 and later)
• RegEx
• Thrift
• Parquet (Hive 0.13 and later)
• CSV (Hive 0.14 and later)
• JsonSerDe (Hive 0.12 and later in hcatalog-core)

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

49. Scenario: How do you process XML data in


Hive?
• Use XML SerDe library
• Specify Fields to be parsed using Xpath
• Specify Input and Output format
• In TBL properties, give the nodes to be considered for processing.

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

50. What are the complex types in Hive?


• Array – to store elements of same type
• Map – to store elements as key-value pair. If you dont know the # of
elements to be stored.
• Struct – to define consistent structure to data
• Union

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

51. How to do Sqoop-Hive integration?


• Generally, we do Sqoop – HDFS – (load data inpath) HIVE
• $ sqoop import --connect jdbc:mysql://localhost/serviceorderdb --username
root -P --table productinfo --hive-import --hive-table
serviceorderdb.productinfo -m 1
• It first loads data into hdfs dir
• Then it internally does load data inpath from hdfs dir to hive table

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

52. Scenario: how do you handle incorrect date


format in Hive? Ex – You are getting date in the
form of dd-MM-YYYY

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

Cont.

• It is a two step process


• Insert the date into string column in managed table.
• Do Substring on string date field and format the date as expected and load to external
table.
• select from_unixtime(unix_timestamp('2016/06/01','yyyy/MM/dd'),'yyyy-MM-dd') from table1;

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

53. Which compression technique is good for


cold data?
• Gzip – for data which is accessed infrequently, provides more compression.
• Snappy – for data which is accessed frequently, provides less compression
and good splittability.

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

54. How do you parse the data if you have used


–enclosed-by option?

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

55. Renaming a table will rename directory as


well?
• Yes, in case of managed table.

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

56. How to rename external table directory by


renaming table?
• We cant rename the directory by just renaming the external table name.
• Make it to Managed Table, rename the table and then make it back to
external table.
• SET TBLPROPERTIES('EXTERNAL'='TRUE');

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

57. Data Format Comparison


ORC – Hortonworks
Parquet - Cloudera

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

58. What is Vectorization?


• This enables hive to fetch 1024 rows in a batch rather fetching single record
at a time.
• set hive.vectorized.execution.enabled = true;
• Example: Below table has 1560 rows

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

59. What are the advantages of bucketing?


• Bucketed tables allow faster execution of map side joins, as the
data is stored in equal-sized buckets/parts.
• Efficient sampling happens for bucketed tables when compared to
non – bucketed ones.
• As in partitioning, the Bucketing feature also offers faster query
performance.

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

60. How do you see which execution engine


hive currently uses?
• Use SET hive.execution.engine

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

61. What is hive thrift server?


• A service which permits a remote client to submit requests to Hive.
• It is also called HiveServer.
• Since It is built on Apache Thrift, it is called thrift server.

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

62. How to make queries to execute in parallel?

• It allows independent tasks in a


query to run in parallel.
• SET hive.exec.parallel = true

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

63. If underlying data in HDFS is not meeting the


type of HIVE table fields, what will happen?

No parser error

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

64. Does Hive is suitable for OLTP?


• No. Arbitrary UPDATE/DELETE is not possible. Hive is suitable for
OLAP.

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

65. How to view partition information of a


table?

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

66. Can we create partition columns after the


table has been created?
• NO

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

67. Can we use LOAD DATA command for


loading data into bucketed table?
• NO. As segregation of data into buckets needs intelligence, will be taken care
by MR.

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

68. How do you run Hadoop commands from


Hive CLI?
• Use !
• Hive> !hadoop fs –cat /user/hive/warehouse/employee/*

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

69. Explain what happens here


sqoop import --connect jdbc:mysql://localhost/custdb --username root --password root --table customers --hive-
import --hive-table retail.customer -m 1;

• Sqoop creates dir customer at /user/hduser/. If the dir customer already exists at
/user/hduser/, it will throw error.
• Sqoop import table customer to /user/hduser/customer
• Create HIVE table based on DDL structure of rdbms, if table doesn’t exists. If
table exists, this step will not occur.
• Load data inpath from sqoop - step 1 location to Hive table location - warehouse
• Delete the customer folder at /user/hduser/customer which is used for loading
data.

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

70. Partition folder and data is there in HDFS but in


HIVE select * you are not able to see the data. Why?
• Metastore is not updated. Do alter table and add partition OR use msck repair tablename

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

71. Can we use partition column for bucketing. Can we do


bucketing and partition on same column.

• No.
• We cant create bucketing on partition column, because partition column is virtual column

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

72. How many bucket file will be created and where it will be
placed if we create 2 bucket on a table which has two partitioned
column?

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

73. Is it good to use index in Hive?


• NO
• Indexing uses separate external table to hold indexes. So, it occupies extra space
• Building indexes will take time and slow down performance during that time.
• Use ORC instead which has built in Indexes.

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

74. How to find total managed and


external tables in HIVE?
• Select TableType, count(*) from TBLS

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

75. Can we create External table using CTAS?

• NO

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

76. How to create table structure from another


table without data?
• create table new_table_name LIKE old_table_name

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

77. How to delete certain records from Hive


table which is not Transactional supported?
• insert overwrite table insurance partition(datadt,hr) select * from insurance where issuerid1
is not null;

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

78. How do you restrict user from inserting


data into table after first insert?
• SET TBLPROPERTIES("immutable"="true")
• First insert and overwrite is allowed.

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

79. How do you skip Header and Footer rows


in Hive

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

Cont.

Use TBLPROPERTIES
skip.header.line.count
skip.footer.line.count

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

80. How do you handle Java Heap


OutOfMemory issue?
• Set the container and heapsize in Hive to resolve this error:
• set mapreduce.map.memory.mb=4096; (container physical memory)
• set mapreduce.reduce.memory.mb=4096; (container physical memory)
• set mapreduce.map.java.opts=-Xmx3686m; (Java heap memory)
• set mapreduce.reduce.java.opts=-Xmx3686m; (Java heap memory)
The Java heap is the area of memory used to store
objects instantiated by applications running on the JVM.
The flag Xmx specifies the maximum memory allocation pool for a Java Virtual
Machine (JVM), while Xms specifies the initial memory allocation pool.

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

81. Scenario: If data is present in multiple


subdirectories in HDFS how do you retrieve those in
Hive table?

set hive.input.dir.recursive=true;
set hive.mapred.supports.subdirectories=true;
set hive.supports.subdirectories=true;
set mapred.input.dir.recursive=true;

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

82. How to enable Fetch Task instead of MapReduce


Job for simple query in Hive

• hive.fetch.task.conversion
• hive.fetch.task.aggr
• hive.fetch.task.conversion.threshold

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

hive.fetch.task.conversion

• This parameter controls which kind of simple query can be converted to a single fetch task.
• Possible values
• None – to disable
• Minimal - SELECT *, FILTER on partition columns (WHERE and HAVING clauses), LIMIT
only.
• More - Value "more" means SELECT, FILTER, LIMIT only (including TABLESAMPLE, virtual
columns)."more" can take any kind of expressions in the SELECT clause, including UDFs.

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

Cont. hive.fetch.task.conversion

And it also allows "select * " with "filter" to use fetch


task, eg:
select * from passwords where col0='root';
So we have to use "more" very carefully and only
use it after fully testing. Because fetch can not
utilize the parallelism of MapReduce framework.

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

Cont. hive.fetch.task.aggr
• Aggregation queries with no group-by clause (for example, select
count(*) from src) execute final aggregations in a single reduce
task. If this parameter is set to true, Hive delegates the final
aggregation stage to a fetch task, possibly decreasing the query
time.
• It can help when the output of the mappers are very small, by
reducing the time for shuffle and launching reduce task.

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

Cont. hive.fetch.task.aggr

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

Cont. hive.fetch.task.aggr

Performance differences

Note, if the query has "group-by", it can not use this


feature. For example:
select count(*) from passwords group by col0;

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

hive.fetch.task.conversion.threshold
• This parameter controls input threshold (in bytes) for applying
hive.fetch.task.conversion.
• Default threshold is 1 GB.
• This parameter calculates or estimates based on the table size, not
the result set size

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

Cont. hive.fetch.task.conversion.threshold

If hive.fetch.task.conversion.threshold is less
than the table size, it will use MapReduce Job,
else Fetch Task.

95 MB 572 MB

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

83. Why do we use RDBMS for Hive


Metastore?
• It has to be quick.
• HDFS – WORM.
• HDFS – not for small files and upsert.

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

84. --incremental lastmodified not working with


avro
• Sqoop currently do not support merge of avro files (which is required by the incremental import in
lastmodified mode). There is already JIRA SQOOP-1094 to add such support.

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

85. Scenario: Whether this works insert overwrite table


payments(customernumber,contactfirstname) select
customernumber,contactfirstname from customer;
• We can't use overwrite with column fields in target table as overwrite means to entire table, not
column specific.

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

86. How to export HIVE table to RDBMS?

• SQOOP Export
• With –export-dir
• sqoop export --connect jdbc:mysql://localhost/custdb --username root –P --table payments
--export-dir /user/hadoopuser/payments
• With Hcatalog
• sqoop export
--connect jdbc:mysql://localhost/retail_db
--username root -P
--table mysql_test
--hcatalog-table test

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

87. Order By, Sort By, Distribute By and


Cluster By.
Order by:
Final output will be sorted using single reducer.
If hive.mapred.mode=strict, then limit clause is required followed by Order By, as MNR uses only one reducer even for huge
dataset, which may slow down the process.
Sort By:
Hive uses the columns in SORT BY to sort the rows before feeding the rows to a reducer.
Difference between Sort By and Order By
Hive supports SORT BY which sorts the data per reducer. The difference between "order by" and "sort by" is that the former
guarantees total order in the output while the latter only guarantees ordering of the rows within a reducer. If there are more than
one reducer, "sort by" may give partially ordered final results.

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

Distribute By:

Hive uses the columns in Distribute By to distribute the


rows among reducers. All rows with the same Distribute By
columns will go to the same reducer. However, Distribute By
does not guarantee clustering or sorting properties
on the distributed keys. To further sort the data,
we have to use "Sort By" along with "Distribute By".

If distribute and sort fields are different, we can use this.

Cluster By:

It is a short-cut for both Distribute By and Sort By.


If distribute and sort fields are same, we can use this.

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

88.How do you determine number of buckets?


• Find Table Size, if it is new table, check the size of SOR table. Say, it is 2560 MB
• HDFS block size = 128 MB
• Bucket calculation
• 2560/128 = 20 (this can also be chosen for bucket count, but bucket will be filled soon and we need to redefine the table
with new bucket count)
Bucket size will be power of 2.
• 2^n >= 20, n = log(20,2) = 4.322 = ~ 5
• Now, 2^5 = 32.
• 32 is the bucket count for this size of table.

• Whenever the table size changes, bucket count will change. For most of the minor increments in table size, existing bucket count
will accommodate that. But when we calculate bucket count, if it comes as 2^6 as per this example, then we need to create a new
table with new bucket size and move the data from old table to new table (insert select) and rename the table.

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

89.If you run a select * query in Hive, Why


does it not run MapReduce?
• It is based on hive.fetch.task.conversion property value. Possible values are none, minimal
and more

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

90.Can I have multi column sub query in hive?

• How to convert this for Hive: Select * from table1 A where (A.id,A.name,A.Roll_no) in ( Select
Id,name,roll_no from table2)
• Answer:
• Select * from table1 A where concat(A.id,A.name,A.Roll_no) in( Select
concat(Id,name,roll_no) from table2 )
OR
• select a.* from table1 a inner join table2 b on (a.id=b.id and a.name=b.name and
a.rollno=b.rollno)

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

91.How to fix vertex error in Hive?


• Vertex error is due to container memory issue.
• set hive.execution.engine=mr instead of TEZ.

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

92.How to read fixed width data in Hive?


• Using RegexSerde

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

93.How to Change table from external to


managed and vice versa?
• alter table customers SET TBLPROPERTIES('EXTERNAL'='FALSE');

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

94.How to handle hive variables which will


change based on environment?
• Use hiveVar or hiveConf
• hive -hiveconf CURRENT_DATE='2020-01-10' -f /home/hduser/xyz.hql
• Use file name
• hive -f /home/hduser/xyz.hql -i /home/hduser/paramfile.txt

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

95.Data will be erased when you drop Managed


& External table?
• Managed Table – Yes
• External Table - No

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

96.How to manage partitions automatically?

• ALTER TABLE customer SET TBLPROPERTIES


('partition.retention.period'='7d');

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

97.External table with partition is dropped by


mistake and how do you reload?
• Create the table back on top of the particular hdfs folder
• Msck repair on table.

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

98.If any external tools loaded data into hive partition


location. How do you ensure those new partitions available
for querying?
• alter table <db_name>.<table_name> add partition(`date`='<date_value>') location
'<hdfs_location_of the specific partition>';
• Msck repair table <db_name>.<table_name>
• Automatic discovery
• set metastore.partition.management.task.frequency=3600; (in seconds)
• ALTER TABLE csc SET TBLPROPERTIES ('discover.partitions' = 'true');

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

99.How to handle small file issue in Hive?


• HAR to reduces files in partition directory
• set hive.archive.enabled=true;
• set hive.archive.har.parentdir.settable=true;
• set har.partfile.size=1099511627776;
• Once the configuration values are set, a partition can be archived with the command:
• ALTER TABLE table_name ARCHIVE PARTITION (partition_col = partition_col_value,
partition_col = partiton_col_value, ...)

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

Setting split size:


set hive.merge.mapfiles=true;
set hive.merge.mapredfiles=true;
set hive.merge.smallfiles.avgsize=104857600;
set hive.merge.size.per.task=209715200;
set mapred.max.split.size=68157440;
set mapred.min.split.size=68157440;
set hive.exec.mappers.bytes.per.mapper=268435456; (how much size of file each mapper should
produce, more the number lesser will be the number of files.)
set hive.exec.reducers.bytes.per.reducer=268435456;
After setting the above configuration, run the below.
INSERT OVERWRITE TABLE table1 SELECT * FROM table1 (small files loaded by nifi or some other
streaming sources);

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

100.If we change the partition location of a hive table using


ALTER TABLE with new location then the data for that partition
will also be moved?
• No. It has to be moved manually into new location

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

101.What is the default maximum dynamic partition


that can be created? How can you change it?
• By default the number of maximum partition that can be created is 100.
• We can change it by issuing the following command:
• SET hive.exec.max.dynamic.partitions.pernode = <value>
• We can set the total number of dynamic partitions that can be created by one
statement by using below
• SET hive.exec.max.dynamic.partitions = <value>

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

102.I dropped a hive managed table, can I get


the hdfs data recovered back?
• Yes, from the trash folder.
• If you don’t want to have the data to be in Trash folder use PURGE option.
• DROP TABLE [IF EXISTS] table_name [PURGE];

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

103.Does “minus”/”except” supported in hive?

• No. Use Left Join

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

104.Subqueries are supported in Hive?


• Hive supports subqueries in the FROM clause (through Hive 0.12).

• Subqueries in the WHERE Clause. Has to be uncorrelated query.

• EXISTS and NOT EXISTS subqueries:

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

105.Is it possible to load the data to hive


without using HDFS in between?
• Use LOAD DATA LOCAL INPATH

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

106.Does hive can manage data only using hdfs


as a storage layer?
• No, hdfs is default, u can use any other storage layers like cloud fs – s3, nosqls, db etc. This can be
done using respective storage handler.

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

107.How to change NULL value to something


else?
• Use COALESCE or NVL. ISNULL is not supported in Hive.
• COALESCE(column, CAST(0 AS BIGINT))
• nvl(T value, T default_value)

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

108.Need to load 1 year data in a table, how to distribute


equally? Assume each day has 128 MB of data.

• set mapred.map.tasks=365;
• set mapred.reduce.tasks=365;
• insert into table abc partition(data_dt) select * from src distribute by data_dt;
This distribute by distributes data equally to 365 mapper/reducers as we set. Data load will be
distributed uniformally otherwise if not we set this, lets say, hive creates 50 mappers, here few
mapper will load more data and few will load less data.

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

109. How do you define small tables in Hive


joins?
• Small table size is defined by
• set hive.mapjoin.smalltable.filesize = 30000000;
• During joins all left tables (except last one) should be small enough to keep those in memory,
and last table will be streamed while join, to perform map-only join.
• If we are not sure how to place the order of tables in query, we can provide STREAMTABLE hint.
• SELECT /*+ STREAMTABLE(table1) */ table1.val, table2.val
FROM table1 JOIN table2 ON (table1.key = table2.key1)

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

110. Difference between ORC & PARQUET?

• ORC
• Vectorization supports only ORC not other file format in Hive

• Acid table properties supports only in ORC upto hive v 3x

• ORC doesn’t support complex types

• PARQUET
• Vectorization supports for Parquet for Spark
• Parquet supports complex types

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

111. When do you use what type of engines?

• MR - Batch ETL/ELT load without strict SLA or ETA.


• SPARK - In-memory, to meet faster SLA/ETA and for complex queries.
• TEZ/IMPALA/PRESTO/TRINO – for in-memory, ad-hoc/real-time simple queries.

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

112. Joins in HIVE


Reduce Side Join

Joins without Optimization

Reduce Side Join

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

Optimized – Streamtable hint

Reduce Side Join

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

Reduce Side Join

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

Map Side Join

Optimized – MapJoin Hint

Map Side Join

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

Optimized – MapJoin Hint

Map Side Join

• Enable auto convert join. So that Hive will use Map side join
wherever possible.
• Set hive.auto.convert.join = true
• Set hive.auto.convert.join.noconditionaltask = true
• Set hive.mapjoin.smalltable.filesize = 300000
• Set hive.auto.convert.join.noconditionaltask.size = 600000

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

120+ Hive Interview Questions


THE BRIGHT SIDE | @ashok.thebrightside | ashok.thebrightside@gmail.com

Explanation on Parameters

120+ Hive Interview Questions


Thank You

You might also like