Literature & Lectures Flashcards

Question

Name 4 reasons why Hadoop is important

Answer 1

1. Hadoop has a better Career Scope 2. A Maturing Technology 3. Data managing 4. Omnipresent 5. Open for all (Hadoop is easy to manage) 6. Professionals shortage

Answer 2

A software system that enables users to define, create, maintain and control access to databases.

Answer 3

1. Navigational models (e.g. hierarchical-, network-, and graph database models) 2. Relational models

Answer 4

a collection of records

Answer 5

Describes the expected number of related occurrences between the two entities in a relationship

Answer 6

Schematic representation of the database

Answer 7

Fitting messy ‘real-life’ data into a homogenized and uniformized database. Making sure that the database is accurate, scalable, easy to update and queried.

Answer 8

redundancy, confusion, improper keys, wasted storage, incorrect/outdated data

Answer 9

1. Data acquisition: produce, derive and collect data 2. Information extraction and cleaning: Pull out information and express it in a structured form suitable for analysis 3. Data integration, aggregation and representation: Collection of heterogeneous data from multiple sources 4. Modeling and analysis: Methods for querying and mining big data 5. Interpretation: A decision-maker had to interpret the results of the analysis

Answer 10

1. Heterogeneity 2. Inconsistency and incompleteness 3. Scale 4. Timeliness 5. Privacy and data ownership

Answer 11

1. Extracting useful knowledge from data by following a process with reasonably well-defined stages. The Cross-Industry Standard Process for Data Mining (CRISP-DM) 2. Evaluating data-science results requires careful consideration of the context in which they will be used. 3. The relationship between the business problem and the analytics solution often can be decomposed into tractable subproblems via the framework of analyzing the expected value 4. Information technology can be used to find informative data items from within a large body of data 5. Entities that are similar with respect to known features or attributes often are similar with respect to unknown features or attributes 6. If you look too hard at a set of data, you will find something—but it might not generalize beyond the data you’re observing ''overfitting'' 7. To draw causal conclusions, one must pay very close attention to the presence of confounding factors, possibly unseen ones

Answer 12

1. Not understanding the issues of integration 2. Not realizing the limits of unstructured data 3. Assuming correlations mean something 4. Underestimating the labor skills needed

Answer 13

1. Start small: define a few relatively simple analytics, this allows the organization to see what the data can do. Also, the results are easier to test. 2. Targeted prototyping Capture only the data you need to perform the test, instead of dealing with all of the data available. This is a lower-risk way to see what big data can do for your firm and to test your firm’s readiness to use it.

Answer 14

Make better decisions Optimize internal operations Optimize external operations Free workers to be more creative Enhance current products

Answer 15

1. New algorithms for machine learning 2. The internet and the cloud 3. Big data 4. Moore’s Law

Answer 16

NLP is a term for everything from speech recognition to language generation, each requiring different techniques (such as chatbots and translations)

Answer 17

1. Heuristics Heuristics are a way to employ a practical method to find a solution that is not guaranteed to be optimal, but one that is sufficient for the immediate goals (such as navigation) 2. Support Vector Machine Classification problems where there is no straight rule for identifying the classes. (such as a spam filter or identifying handwriting or characters) 3. Artificial Neutral Networks Understanding complex relationships between features of a certain item (such as image or speech recognition) 4. Markov Decision Process Find a policy for the decision-maker, tell him which particular action should be taken at which state. Solving complex decision-making problems (such as inventory planning) 5. Neutral Language Processing NLP is a term for everything from speech recognition to language generation, each requiring different techniques (such as chatbots and translations)

Answer 18

The ability to understand

Answer 19

Linear Regression allows us to map numeric inputs to numeric outputs, fitting a line into the data points.

Answer 20

To capture the dominant trend and fit our line within that trend. Note: We always want to find the trend, not fit the line to all the data points!!!

Answer 21

Machine Learning models fulfill their purpose when they generalize well. Generalization is bound by the two undesirable outcomes — high bias and high variance. A situation with low bias and low variance represents the desired situation ''generalization'' This trade-off is the most integral aspect of Machine Learning model training. Detecting whether the model suffers from either one is the sole responsibility of the model developer.

Answer 22

!! High Bias !! Underfitting is the case where the model has “ not learned enough” from the training data, resulting in low generalization and unreliable predictions.

Answer 23

!! High Variance !! Overfitting is the case where the overall cost is really small, but the generalization of the model is unreliable. This is due to the model learning “too much” from the training data set.

Literature & Lectures Flashcards

Final Exam 10-06-2020 (47 cards)