FinalExamHadoop Flashcards

1
Q

What are the three areas of expertise that you need to make a unicorn…. datascientist

A
Hacking skills(programming, get/manipulate/explore data)
Math and Stats (foundational statistics, internals of algorithms, practical knowledge)
Domain Knowledge( understand the business problem, Understand the data)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the truth about learning?

A

When you stop learning your salary stops growing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what are the last two v’s of big data?

A

Value (statistical, hypothetical)

Veracity (trustworthiness)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a winning interview question for you to ask the employer

A

Do you have a centralized or decentralized data model?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what is the basis of machine learning?

A

Extracting knowledge from data.

at the intersection of stats, AI and Computer Science

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

what is supervised learning?

A

automate decision making processes by generalizing from known examples.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

what is unsupervised learning?

A

harder to understand and evaluate than the supervised learning example you have merely data. You have zero answers going in.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

what is overfitting?

A

when you fit a model too closely to particularities of the training set and works well for it but not for new data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

what is underfitting?

A

making your model too simple

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

what are HIVE and Impala

A

SQL like programs. Impala is super fast where HIVE is a little more robust

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what is a select statement in hadoop

A

map

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

what is a group by statement in hadoop

A

reduce by

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

what is a where clause in hadoop

A

filter

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

what are the benefits of spark?

A

ii. Benefits: optimization, lazy loading/evaluation, speed, efficiency
iii. Blazing fast because it keeps stuff in memory until an action is needed
III. fault tolerance you can go back and rerun a portion.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

what is an RDD

A

steps on a map

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

what is a DAG

A

it is a process of how all your rdd’s work together to make a complete map of the transformtions you are going to make to the data.

17
Q

what are key value pairs?

A

an array with keys (which can change), know how this is different from a unique identifier (primary key in SQL); why do they matter, why do they get their own command set

18
Q

why does a simple reduce not work in hadoop?

A

k. Why does the simple reduce not work with aggregates across partitions? (reduce hits the rows and combines two at a time, then moves on and combines the next one together; reduces everything by nodes, not overall, then between nodes combines, which can lead to inaccurate results)

19
Q

what is the benefit of reduce by key?

A

helps with partition stability