FinalExamHadoop Flashcards

Question 1

Q

What are the three areas of expertise that you need to make a unicorn…. datascientist

Answer

A

Hacking skills(programming, get/manipulate/explore data)
Math and Stats (foundational statistics, internals of algorithms, practical knowledge)
Domain Knowledge( understand the business problem, Understand the data)

Question 2

Q

What is the truth about learning?

Answer

A

When you stop learning your salary stops growing.

Question 3

Q

what are the last two v’s of big data?

Answer

A

Value (statistical, hypothetical)

Veracity (trustworthiness)

Question 4

Q

What is a winning interview question for you to ask the employer

Answer

A

Do you have a centralized or decentralized data model?

Question 5

Q

what is the basis of machine learning?

Answer

A

Extracting knowledge from data.

at the intersection of stats, AI and Computer Science

Question 6

Q

what is supervised learning?

Answer

A

automate decision making processes by generalizing from known examples.

Question 7

Q

what is unsupervised learning?

Answer

A

harder to understand and evaluate than the supervised learning example you have merely data. You have zero answers going in.

Question 8

Q

what is overfitting?

Answer

A

when you fit a model too closely to particularities of the training set and works well for it but not for new data

Question 9

Q

what is underfitting?

Answer

A

making your model too simple

Question 10

Q

what are HIVE and Impala

Answer

A

SQL like programs. Impala is super fast where HIVE is a little more robust

Question 11

Q

what is a select statement in hadoop

Question 12

Q

what is a group by statement in hadoop

Answer

A

reduce by

Question 13

Q

what is a where clause in hadoop

Question 14

Q

what are the benefits of spark?

Answer

A

ii. Benefits: optimization, lazy loading/evaluation, speed, efficiency
iii. Blazing fast because it keeps stuff in memory until an action is needed
III. fault tolerance you can go back and rerun a portion.

Question 15

Q

what is an RDD

Answer

A

steps on a map

Question 16

Q

what is a DAG

Answer

Study These Flashcards

A

it is a process of how all your rdd’s work together to make a complete map of the transformtions you are going to make to the data.

Question 17

Q

what are key value pairs?

Answer

Study These Flashcards

A

an array with keys (which can change), know how this is different from a unique identifier (primary key in SQL); why do they matter, why do they get their own command set

Question 18

Q

why does a simple reduce not work in hadoop?

Answer

Study These Flashcards

A

k. Why does the simple reduce not work with aggregates across partitions? (reduce hits the rows and combines two at a time, then moves on and combines the next one together; reduces everything by nodes, not overall, then between nodes combines, which can lead to inaccurate results)

Question 19

Q

what is the benefit of reduce by key?

Answer

Study These Flashcards

A

helps with partition stability

FinalExamHadoop Flashcards

(19 cards)