Hive - interview questions Flashcards
Q: Why do we need Hive?
A: a Hive is a tool in the Hadoop ecosystem that provides an interface to organize and query data in a database like fashion and write SQL like queries. It is suitable for accessing and analyzing data in Hadoop using SQL syntax.
Q: Is hive suitable to be used for OLTP systems? Why?
A: No, Hive does not provide insert and update at row level. So it is not suitable for OLTP system.
Q: What is SerDe in Apache Hive?
A: basically, for Serializer/Deserializer, SerDe is an acronym. However, for the purpose of IO, Hive uses the Hive SerDe interface. Hence, it handles both serialization and deserialization in the Hive. Also, it interprets the results of serialization as individual fields for processing.
Q: What is Hive Metastore?
A: Hive metastore is a database that stores metadata about your Hive tables (eg. Table name, column names and types, table location, storage handler being used, number of buckets in the table, sorting columns if any, partition columns if any, etc.).
When you create a table, this metastore gets updated with the information related to the new table which gets queried when you issue queries on that table.
Hive is a central repository of hive metadata. It has 2 parts of services and data. By default, it uses derby DB in local disk. It is referred to as an embedded metastore configuration. It tends to the limitation that only one session can be served at any given point of time.
Q: What are Binary storage formats hive supports?
A: Hive natively supports the text file format, however, hive also has support for other binary formats. Hive supports Sequence, Avro, RCFiles.
Sequence files:-General binary format. Split table, compressible and row-oriented. a typical example can be. If we have lots of small file, we may use a sequence file as a container, where filename can be a key and content could store as value. It supports compression which enables huge gain in performance.
Avro data files:-Same as Sequence file splittable, compressible and row-oriented except support of schema evolution and multilingual binding support.
RCFiles :-Record columnar file, it’s a column-oriented storage file. It breaks the table in row split. In each split stores that value of the first row in the first column and followed sub subsequently.
Q: What is the difference between the external table and the managed table?
A:
Managed (or internal) table
The metadata information along with the table data is deleted from the Hive warehouse directory if one drops a managed table.
External table
Hive just deletes the metadata information regarding the table. Further, it leaves the table data present in HDFS untouched.
Is it possible to change the default location of a managed table?
Yes
Q: What is a partition in Hive?
A: Basically, for the purpose of grouping similar types of data together on the basis of column or partition key, Hive organizes tables into partitions. Moreover, to identify a particular partition each table can have one or more partition keys. On defining Hive Partition, in other words, it is a sub-directory in the table directory.
Q: What is dynamic partitioning and when is it used?
A: Dynamic partitioning values for partition columns are known in the runtime. In other words, it is known during the loading of the data into a Hive table.
Q: How to skip header rows from a table in Hive?
A: Header records in log files:
We do not want to include the above three lines of headers in our Hive query. To skip header lines from our tables in Hive, set a table property that will allow us to skip the header lines.
CREATE EXTERNAL TABLE employee ( name STRING, job STRING, dob STRING, id INT, salary INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘ ‘ STORED AS TEXTFILE LOCATION ‘/user/data’ TBLPROPERTIES("skip.header.line.count"="2”);
Q: What is the maximum size of a string data type supported by Hive?
A: The maximum size of a string data type supported by Hive is 2 GB.