FinalExamHadoop Flashcards
What are the three areas of expertise that you need to make a unicorn…. datascientist
Hacking skills(programming, get/manipulate/explore data) Math and Stats (foundational statistics, internals of algorithms, practical knowledge) Domain Knowledge( understand the business problem, Understand the data)
What is the truth about learning?
When you stop learning your salary stops growing.
what are the last two v’s of big data?
Value (statistical, hypothetical)
Veracity (trustworthiness)
What is a winning interview question for you to ask the employer
Do you have a centralized or decentralized data model?
what is the basis of machine learning?
Extracting knowledge from data.
at the intersection of stats, AI and Computer Science
what is supervised learning?
automate decision making processes by generalizing from known examples.
what is unsupervised learning?
harder to understand and evaluate than the supervised learning example you have merely data. You have zero answers going in.
what is overfitting?
when you fit a model too closely to particularities of the training set and works well for it but not for new data
what is underfitting?
making your model too simple
what are HIVE and Impala
SQL like programs. Impala is super fast where HIVE is a little more robust
what is a select statement in hadoop
map
what is a group by statement in hadoop
reduce by
what is a where clause in hadoop
filter
what are the benefits of spark?
ii. Benefits: optimization, lazy loading/evaluation, speed, efficiency
iii. Blazing fast because it keeps stuff in memory until an action is needed
III. fault tolerance you can go back and rerun a portion.
what is an RDD
steps on a map