Schema Design Flashcards

1
Q

What are the restrictions on column family names?

A

Must use printable characters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Is it better to use longer or shorter column family and column names, and why?

A

Shorter. Each row in the Hfile contains both the column family name and the column name so long names waste space

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the recommended maximum number of column families

A

No more than 3 columns families per table.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

When designing column families for data what is recommended?

A

Keep data that is accessed simultaneously together

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Flushing and Compaction occur per what?

A

Region

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What triggers a minor compaction?

A

The number of files per column family

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

If one column family is large and has lots of files, will the other column families for that table also be flushed from Memstore?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

The more column families, the greater the ___ load?

A

I/0 load

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the most common attributes on a column family?

A
COMPRESSION
VERSIONS
TTL
MIN_VERSIONS
BLOCKSIZE
IN_MEMORY
BLOCKCACHE
BLOOMFILTER
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the valid values for compression? What is the default?

A

NONE, GZ, LSO, SNAPPY.

The default is NONE.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the valid values for VERSIONS? What is the default?

A

1+. The default is 3.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the valid values for MIN_VERSIONS? What is the default?

A

0+. The default is 0.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the valid values for BLOCKSIZE? What is the default?

A

1 byte - 2GB

The default is 64k

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the valid values for IN_MEMORY? What is the default?

A

true, false

The default is false

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the valid values for BLOOMFILTER? What is the default?

A

NONE, ROL,ROWCOL

The default is NONE.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Is compression recommended?

A

Yes for columns not containing already compressed data such as JPEG or PNG

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the syntax for enabling compression on a column family?

A

alter ‘table’, {NAME => ‘colfam’, COMPRESSION => ‘codec’}

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What does the VERSION attribute specify?

A

How many versions of a cell to retain

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What does TTL specify?

A

The Time to Live for a cell. Cells are automatically deleted after the specified number of seconds

20
Q

What does MIN_VERSIONS specify?

A

The minimum number of versions of a cell to retain.

21
Q

When specifying MIN_VERSIONS what else must be specified?

A

TTL

22
Q

Are there any restrictions on the value of MIN_VERSIONS?

A

It must be smaller than the value of VERSIONS

23
Q

What scenario does using all three VERSION, TTL and MIN_VERSION settings cause?

A

keep the last T seconds worth of data, at most N versions, but retain at most M versions

24
Q

What does the BLOCKSIZE specify?

A

The minimum amount of data read during any read request

25
Q

If your workload typically consists of random reads how should you set your BLOCKSIZE?

A

A small value

26
Q

Setting a _______ value for BLOCKSIZE generally improves performance for _____.

A

LARGE

SCANS

27
Q

What does the BLOCKCACHE setting do?

A

It is an in memory cache use for reads

28
Q

What algorithm is used to evict data from the cache when it is full?

A

LRU (least recently used)

29
Q

If you set the BLOCKCACHE to false what does that do?

A

It ensures that the data read during a scan or get is not stored in block cache.

30
Q

What would you set the BLOCKCACHE to false?

A

Avoid pollutin the block cache with data from seldom-selected column families.

31
Q

What does setting the IN_MEMORY setting to true do?

A

Ensures that the data from the Column Family is only evicted from the Block Cache when absolutely necessary

32
Q

What is a Bloom Filter?

A

It is a data structure which allows an existence check for a particular piece of data

33
Q

Can a Bloom filter say that a piece of data definitely exists?

A

No. It can only indicate that it may exist. Or it knows that it definitely does not exist.

34
Q

How do Bloom filters improve read performance?

A
  • Eliminates the need to read every store file

- Allows the RegionServer to skip files that do not contain the row or row and column

35
Q

How do you enable a bloom filter.

A

alter ‘table’, {NAME => ‘colfam’, BLOOMFILTER => ‘row’}

36
Q

Where are bloom filters kept?

A

In the store file

37
Q

When should you not use bloom filters?

A

When all the rows are updated regularly and the rows are spread across most store files

38
Q

When should you use bloom filters?

A
  • Access patterns with lots of misses during reads
  • Speed up reads by cutting down on the store file reads
  • Update data in batches so that rows are in fewer store files
39
Q

Joins are costly in Hbase, how should you design your schema to get around this limitation?

A
  • Avoid where possible
  • Data should be denormalized
  • Code must be written to update denormalized data everywhere
40
Q

What are the two approaches to table layouts and what access pattern is recommended for each.

A

Tall and Narrow - Scans

Flat and Wide - Gets

41
Q

Do tall - narrow and flat - wide tables use the same footprint?

A

Yes

42
Q

What are the pluses and minuses of using tall and narrow tables?

A

The minus is that you have less atomicity because the data is spread across more rows.

The plus is that you can use partial key scans to reconstruct all the rows.

43
Q

Since HBase does not support secondary indexes, what are your options?

A
  • Run a filter query using the API
  • Create a secondary index using a map reduce job.
  • Create summary tables
44
Q

What causes Hotspotting?

A

When a small number of region servers are handling the load.

45
Q

Give some examples of what can cause Hotspotting?

A
  • An automatic or pre-split region is not optimal
  • The row key is sequential or time series
  • The regions for a table are not distributed around a cluster efficiently