FinalExamHadoop Flashcards
What are the three areas of expertise that you need to make a unicorn…. datascientist
Hacking skills(programming, get/manipulate/explore data) Math and Stats (foundational statistics, internals of algorithms, practical knowledge) Domain Knowledge( understand the business problem, Understand the data)
What is the truth about learning?
When you stop learning your salary stops growing.
what are the last two v’s of big data?
Value (statistical, hypothetical)
Veracity (trustworthiness)
What is a winning interview question for you to ask the employer
Do you have a centralized or decentralized data model?
what is the basis of machine learning?
Extracting knowledge from data.
at the intersection of stats, AI and Computer Science
what is supervised learning?
automate decision making processes by generalizing from known examples.
what is unsupervised learning?
harder to understand and evaluate than the supervised learning example you have merely data. You have zero answers going in.
what is overfitting?
when you fit a model too closely to particularities of the training set and works well for it but not for new data
what is underfitting?
making your model too simple
what are HIVE and Impala
SQL like programs. Impala is super fast where HIVE is a little more robust
what is a select statement in hadoop
map
what is a group by statement in hadoop
reduce by
what is a where clause in hadoop
filter
what are the benefits of spark?
ii. Benefits: optimization, lazy loading/evaluation, speed, efficiency
iii. Blazing fast because it keeps stuff in memory until an action is needed
III. fault tolerance you can go back and rerun a portion.
what is an RDD
steps on a map
what is a DAG
it is a process of how all your rdd’s work together to make a complete map of the transformtions you are going to make to the data.
what are key value pairs?
an array with keys (which can change), know how this is different from a unique identifier (primary key in SQL); why do they matter, why do they get their own command set
why does a simple reduce not work in hadoop?
k. Why does the simple reduce not work with aggregates across partitions? (reduce hits the rows and combines two at a time, then moves on and combines the next one together; reduces everything by nodes, not overall, then between nodes combines, which can lead to inaccurate results)
what is the benefit of reduce by key?
helps with partition stability