Schema Design Flashcards by Erin Driggers

What are the restrictions on column family names?

Must use printable characters

How well did you know this?

Not at all

Perfectly

Is it better to use longer or shorter column family and column names, and why?

Shorter. Each row in the Hfile contains both the column family name and the column name so long names waste space

How well did you know this?

Not at all

Perfectly

What is the recommended maximum number of column families

No more than 3 columns families per table.

How well did you know this?

Not at all

Perfectly

When designing column families for data what is recommended?

Keep data that is accessed simultaneously together

How well did you know this?

Not at all

Perfectly

Flushing and Compaction occur per what?

Region

How well did you know this?

Not at all

Perfectly

What triggers a minor compaction?

The number of files per column family

How well did you know this?

Not at all

Perfectly

If one column family is large and has lots of files, will the other column families for that table also be flushed from Memstore?

Yes

How well did you know this?

Not at all

Perfectly

The more column families, the greater the ___ load?

I/0 load

How well did you know this?

Not at all

Perfectly

What are the most common attributes on a column family?

COMPRESSION
VERSIONS
TTL
MIN_VERSIONS
BLOCKSIZE
IN_MEMORY
BLOCKCACHE
BLOOMFILTER

How well did you know this?

Not at all

Perfectly

What are the valid values for compression? What is the default?

NONE, GZ, LSO, SNAPPY.

The default is NONE.

How well did you know this?

Not at all

Perfectly

What are the valid values for VERSIONS? What is the default?

1+. The default is 3.

How well did you know this?

Not at all

Perfectly

What are the valid values for MIN_VERSIONS? What is the default?

0+. The default is 0.

How well did you know this?

Not at all

Perfectly

What are the valid values for BLOCKSIZE? What is the default?

1 byte - 2GB

The default is 64k

How well did you know this?

Not at all

Perfectly

What are the valid values for IN_MEMORY? What is the default?

true, false

The default is false

How well did you know this?

Not at all

Perfectly

What are the valid values for BLOOMFILTER? What is the default?

NONE, ROL,ROWCOL

The default is NONE.

How well did you know this?

Not at all

Perfectly

Is compression recommended?

Yes for columns not containing already compressed data such as JPEG or PNG

How well did you know this?

Not at all

Perfectly

What is the syntax for enabling compression on a column family?

alter ‘table’, {NAME => ‘colfam’, COMPRESSION => ‘codec’}

How well did you know this?

Not at all

Perfectly

What does the VERSION attribute specify?

How many versions of a cell to retain

How well did you know this?

Not at all

Perfectly

What does TTL specify?

The Time to Live for a cell. Cells are automatically deleted after the specified number of seconds

What does MIN_VERSIONS specify?

The minimum number of versions of a cell to retain.

When specifying MIN_VERSIONS what else must be specified?

TTL

Are there any restrictions on the value of MIN_VERSIONS?

It must be smaller than the value of VERSIONS

What scenario does using all three VERSION, TTL and MIN_VERSION settings cause?

keep the last T seconds worth of data, at most N versions, but retain at most M versions

What does the BLOCKSIZE specify?

The minimum amount of data read during any read request

If your workload typically consists of random reads how should you set your BLOCKSIZE?

A small value

Setting a _______ value for BLOCKSIZE generally improves performance for _____.

LARGE | SCANS

What does the BLOCKCACHE setting do?

It is an in memory cache use for reads

What algorithm is used to evict data from the cache when it is full?

LRU (least recently used)

If you set the BLOCKCACHE to false what does that do?

It ensures that the data read during a scan or get is not stored in block cache.

What would you set the BLOCKCACHE to false?

Avoid pollutin the block cache with data from seldom-selected column families.

What does setting the IN_MEMORY setting to true do?

Ensures that the data from the Column Family is only evicted from the Block Cache when absolutely necessary

What is a Bloom Filter?

It is a data structure which allows an existence check for a particular piece of data

Can a Bloom filter say that a piece of data definitely exists?

No. It can only indicate that it may exist. Or it knows that it definitely does not exist.

How do Bloom filters improve read performance?

- Eliminates the need to read every store file | - Allows the RegionServer to skip files that do not contain the row or row and column

How do you enable a bloom filter.

alter 'table', {NAME => 'colfam', BLOOMFILTER => 'row'}

Where are bloom filters kept?

In the store file

When should you not use bloom filters?

When all the rows are updated regularly and the rows are spread across most store files

When should you use bloom filters?

- Access patterns with lots of misses during reads - Speed up reads by cutting down on the store file reads - Update data in batches so that rows are in fewer store files

Joins are costly in Hbase, how should you design your schema to get around this limitation?

- Avoid where possible - Data should be denormalized - Code must be written to update denormalized data everywhere

What are the two approaches to table layouts and what access pattern is recommended for each.

Tall and Narrow - Scans | Flat and Wide - Gets

Do tall - narrow and flat - wide tables use the same footprint?

Yes

What are the pluses and minuses of using tall and narrow tables?

The minus is that you have less atomicity because the data is spread across more rows. The plus is that you can use partial key scans to reconstruct all the rows.

Since HBase does not support secondary indexes, what are your options?

- Run a filter query using the API - Create a secondary index using a map reduce job. - Create summary tables

What causes Hotspotting?

When a small number of region servers are handling the load.

Give some examples of what can cause Hotspotting?

- An automatic or pre-split region is not optimal - The row key is sequential or time series - The regions for a table are not distributed around a cluster efficiently