Hive Flashcards
How to create a table partitioned on few columns?
create table dbname.table_name (
id int, name string, city string, state string, month string)
partitioned by (state string, month string)
row format delimited fields terminated by “|”;
How will you put data into table from HDFS and Local into table?
load data [local] in path “//: path/to/file.csv” into table < table_name> [partition (partcol1 = val1, partcol2 = val2)..]
[INPUTFORMAT ‘inputformat’ SERDE’ serde’]
what is the difference between ORC vs Parquet ?
Similarity :- Both are columnar based file format.
Diff :-
1. ORC is best suited with hive.
2. parquet is best suited with spark.
In terms of Storage, ORC takes less space than parquet.
for nested data, parquet is good.
Searching will be faster in ORC as it uses indexing.
In terms of Compatibility with other platforms, Parquet is better than ORC.
what makes parquet faster than CSV?
Parquet is often faster than CSV for data processing due to the following key factors:
Columnar Storage: Parquet stores data in a columnar format, which allows for more efficient compression and encoding of data. This means that when you need to read only specific columns, Parquet can read and decode only the necessary data, resulting in significantly faster query performance compared to CSV.
Compression: Parquet uses column-level compression techniques that can lead to much smaller file sizes. Smaller file sizes mean less data to read from storage, reducing I/O overhead and speeding up data access.
Predicate Pushdown: Many modern data processing engines, like Apache Spark and Apache Hive, can leverage Parquet’s metadata to skip reading irrelevant data blocks based on query predicates. This further reduces the amount of data to be read from disk.
Schema Evolution: Parquet supports complex data types and schema evolution, which is beneficial in big data scenarios where data schemas can change over time. This flexibility can lead to more efficient data storage and access.
Parallel Processing: Parquet files can be split into smaller row groups, which allows for parallel reading and processing, especially when used with distributed data processing frameworks like Hadoop.
Type Encoding: Parquet uses efficient type encoding, such as dictionary encoding for string columns, which can significantly reduce storage space and improve query performance.
Compression Algorithms: Parquet allows for different compression algorithms to be used, enabling users to choose the one that best suits their data and use case.
In summary, Parquet’s columnar storage, efficient compression, and compatibility with modern data processing engines make it faster and more efficient than CSV for many big data processing tasks.
Bucketing VS partitioning
Hive Partitioning :- is a way to organize large tables into smaller logical tables based on values of columns. one logical table (partition) for each distinct value.In Hive,tables are created as a directory on HDFS.
Hive Bucketing :- (Clustering) is a technique to split the data into more manageable files. By specifying the number of buckets to create.
Syntax :-
CREATE TABLE your_table_name (
column1 data_type,
column2 data_type,
…
)
PARTITIONED BY (partition_column data_type)
CLUSTERED BY (bucketed_column) INTO num_buckets
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’
STORED AS ORC
TBLPROPERTIES (‘orc.compress’=’ZLIB’);
– Example of how to insert data into the table with partition and bucket information
– Replace values with actual data
INSERT INTO your_table_name PARTITION(partition_column=’value’)
SELECT column1, column2, …
FROM your_source_table;
how to create external table from internal table ?
Alter table inner_table set TBLPROPERTIES(‘EXTERNAL’ =’TRUE’);
what is vectorization in hive ?
allows hive to process a batch of rows together instead of processing one by one.
Vectorized query execution streamlines operations by processing a block of 1024 rows at a time.
to enable vectorization, need to set the configurations
hive.vectorized.execution.enabled = true
Currently , single table read only query
supported operators :- selection, filter and groupby
Paritioned tables are supported.
Data warehouse design
Basic Command of Spark to read a file and store the data in a table?