Big data Flashcards

1
Q

What are the FOUR dimensions of Big Data?

A
  • Volume: refers to the quantity of available data
  • Velocity: refers to the rate at which the data is recorded/collected
  • Veracity: refers to quality and applicability of data
  • Variety: refers to the different type of available data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What characterizes big data/how is it defined?

A

Big Data is the Information asset characterized by such a High Volume, Velocity and Variety to require specific Technology and Analytical Methods for its transformation into Value

Gartner: Big data is high-volume, highvelocity and high-variety information assets that demand cost-effective, innovative forms of information processing for
enhanced insight and decision making

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the drivers behind big data?

A
  • Increased data volumes being captured and stored
  • Rapid acceleration of data growth
  • Increased data volumes pushed into the network
  • Growing variation in types of data assets for analysis
  • Alternate and unsynchronized methods for facilitating data delivery
  • Rising demand for real-time integration of analytical results
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is NoSQL?

A

“Not only SQL” –> alternate model for data management

  • Provides a variety of methods for managing information to best suit specific business process needs, such as in-memory data management, columnar layouts to speed query response, and graph databases
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is MPP? and how is it related to Big Data?

A

Massively parallel Processing

–> A type of computer, utilizing high-bandwidth networks and massive I/O devices

RELATION TO BD:
- Big data is smarter, since it couples clusters of hardware components with open source tools and technology

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What five aspects will a corporation considering incorporating Big Data need to consider?

A

• Feasibility: Is the enterprise aligned in a way that allows for new and emerging technologies to be brought into
the organization?
• Reasonability: will the resource requirements exceed the capability of the existing or planned environment?
• Value: do the result warrant the investment?
• Integrability: any constraints or impediments within the organization from a technical, social, or political
perspective?
• Sustainability: are costs associated with maintenance, configuration, skills maintenance, and adjustments to the
level of agility sustainable?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Name the 7 types of people needed for implementing Big Data?

A

1) Business envangelist –> understands current limitations of existing tech infrastructure
2) Technical envangelist –> undestands the emerging tech and the science behind
3) business analyst –> engages the business process owners, and identify measures to quantify
4) Big Data application architect –> Experienced in high performance computing
5) application developer –> Identify the technical resources with the right set of skills for programming
6) Program manager –> experienced in project managment
7) data scientist –> Experienced in coding and statistics/AI

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the Big Data framework? and what key components does it consist of?

A

Overall picture of the Big Data landscape, consists of:

  • Infrastructure (e.g. SAP and SQL)
  • Analytics (e.g. Google analytics)
  • Applications (e.g. Human capital, legal, security)
  • Cross-infrastructure analytics (Google, Microsoft, Oracle)
  • Open source (R-studio,
  • Data Sources and API(Application Programming Interface)s (Garmin, Apple)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is API?

A

Application programming interface

is a set of routines, protocols, and tools for building software applications. Basically, an API specifies how software components should interact. Additionally, APIs are used when programming graphical user interface (GUI) components

  • Enables software components to talk to each other
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Which is better row or column-oriented data?

A

Column-oriented data; since it reduces the latency by storing each column separately

Access performance; ROW: not good for many simultaneous queries (as opposed to column)

Speed of aggregation; Much faster in column-oriented data

Suitability to compression; column-data better suited for compression, decreasing storage needs.

Data load speed; faster in column, since data is stored separately you can load in parallel using multiple threads

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Hardware versus software?

A

Go to slide 36 and 37 and discuss

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Name the four tools and techniques?

A

Processing capability
- Often interconnected by several nodes, allowing tasks to be run simultaneously called MULTITHREADING

Storage of data

Memory
- Holds the data in the node currently running

Network
- Communication infrastructure between the nodes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What types of architectural clusters exist? And what the two OVERALL?

A

Slide 42 and 43:

OVERALL: centralized and decentralized

  • Fully connected network topology
  • Mesh network topology
  • Star network topology
  • Common bus topology
  • Ring network topology
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What does the general architecture distinguish between? and what are their roles?

A

Management of computing resources
- oversees the pool of processing nodes, assign tasks and monitors activity

Management of data/storage
- oversees the data storage pool and distributes datasets across the collection of storage resources

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the three important layers of Hadoop?

A
  • Hadoop Distributed File System (HDFS)
  • MapReduce
  • YARN: a new generation framework for job scheduling and cluster management
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the main function of HDFS?

A
  • Attempts to enable the storage of LARGE files, by distributing the data among a pool of data nodes
  • Monitoring of communication between nodes and masters
  • Rebalancing of data from one block to another if free capacity is available
  • Managing integrity using checksums/digital signatures
  • Metadata replication to protect against corruption
  • Snapshots/copying of data to establish check-points
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are the four advantages of using HDFS?

A

1) decreasing the cost of specialty large-scale storage systems
2) providing the ability to rely on commodity components
3) enabling the ability to deploy using cloud-based services
4) reducing system management costs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is MapReduce?

A
  • It is a software framework
  • Used to write applications which process vast amount of data in-parallel on large clusters
  • It is fault-tolerant
  • Combines both data and computational independence
    (both data and computations can be distributed across nodes, which enables strong parallelization)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What are the two steps in MapReduce?

A

Map: Describes the computation analysis applied to a set of input key/value pairs to produce a set of intermediate key/value pairs

Reduce: the set of values associated with the intermediate key/value pairs output by the Map operation are combined to provide the results

Example; count the number of occurences of a word in a corpus:

key: is the word
value: is the number of times the word is counted

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is parallelization?

A

the act of designing a computer program or system to process data in parallel. Normally, computer programs compute data serially: they solve one problem, and then the next, then the next. If a computer program or system is parallelized, it breaks a problem down into smaller pieces that can each independently be solved at the same time by discrete computing resources

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What are the four use cases for big data?

A

Counting; document indexing, filtering, aggregation

Scanning; sorting, text analysis, pattern recognition

Modeling; analysis and prediction

Storing; rapid access to stored large datasets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is data mining?

A

The art and science of discovering knowledge, insights and patterns in data

  • e.g. predicting the winning chances of a sports team
  • or identifying friends and foes in warfare
  • or forecasting rainfall patterns in a region

It helps recognizing the hidden value in data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Describe the typical process of data mining?

A
  1. Understand the application domain
  2. Identify data sources and select target data
  3. Pre-process: cleaning, attribute selection
  4. Data mining to extract patterns or models
  5. Post-process: identifying interesting or useful patterns
  6. Incorporate patterns in real world tasks

OR:

Data input –> data consolidation –> data cleaning –> data transformation –> data reduction –> well-formed data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

In terms of data mining, what does ETL stand for?

A

Extract, transform, load

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What are some of the common mistakes of data-mining?
1. Selecting the wrong problem for data mining 2. Not leaving insufficient time for data acquisition, selection and preparation 3. Looking only at aggregated results and not at individual records/predictions 4. Being sloppy about keeping track of the data mining procedure and results 5. Ignoring suspicious (good or bad) findings and quickly moving on 6. Running mining algorithms repeatedly and blindly, without thinking about the next stage 7. Naively believing everything you are told about the data 8. Naively believing everything you are told about your own data mining analysis
26
Describe the Havard Business Review matrix
Two parameters; 1) Not useful --> Very useful 2) Time-consuming --> not time-consuming Upper-left: Plan Upper-right: learn Lower-left: ignore Lower-right: browse BEWARE --> VERY bad matrix, cannot be used
27
Why do we have an error terms in a statistical linear model?
Because the model is not deterministic (perfect) | Error terms are normally distributed!
28
What method is usually used to estimate a linear relationship?
Ordinary Least Square (OLS)
29
How do we define the slope, i.e. B1?
Cov(x,y) / Var(x)
30
How do we define the intercept?
B0 = y - B1x
31
What is the estimated variance of error?
SSE / (n - 2): | SSE: Sum of squared errors AKA. sum of squared residuals SSR
32
How do we find R^2 and what is its nickname?
Coefficient of determination ESS / TSS or (1 - RSS/TSS) Or; explained sum of squares over total variance
33
What are the four principal assumptions that justify the use of a linear model?
1) Linearity and additivity - Slope of the line does not depend on the values of the other variables - Effects from each independ. var. is additive to each other to estimate the dependent variable 2) Statistical independence of the errors 3) Homoscedasticity: constant variance of errors 4) Normality of the error distribution
34
What happens when linearity and additivity is violated in the OLS?
- NOT good - Use another model that is not linear Either; logarithmic (if all datapoints are positive) or polynomial model
35
What happens when independence of the error terms is violated in the OLS?
Diagnosis: look at a residual time series plot - Often a concern in panel data/longitudinal data - If minor cases of positive serial correlation; add the lagged independent variable as predictor - If minor cases negative serial correlation, solve by differentiating - If large serial correlation --> RESTRUCTURE THE MODEL, maybe standardize all variables
36
What happens when homoscedasticity is violated using the OLS?
Diagnosis; plot the residuals versus the predicted values - You get imprecise predictions and confidence intervals - If errors are increasing over time, CIs for out-of-sample predictions are unrealistically narrow SOLUTION: robust standard errors or transformation of model
37
What happens when normality is vioalted using the OLS?
Diagnosis; Plot of normality or a Shapiro-Wilk test - Makes it very hard to determine if a coefficient is significantly different from zero and to provide CIs BUT; if your goal is only to estimate the values of the coefficients, then THIS IS NOT A PROBLEM, only if you have to do predictions
38
What model can we use, in the case the outcome is not normal (Gaussian)?
Generalized Linear Model
39
What is a Generalized Linear Model (GLM)?
Characterized by THREE components: 1) Random: associated with the dependent variable and its probability distribution 2) systematic: identifies the selected covariates through a linear predictor 3) link function: identifies the function of E[Y] such that it is equal to the systematic component
40
Normally, will a binary response be derived from a linear relation?
No; it will typically be non-linear --> could use logistic regression
41
What shape does the logistic regression have?
S-curved As you will almost never reach 1 and almost never 0
42
What is panel data and what is the associated model?
Panel(aka longitudinal) data; repeated measure of the same subject --> Panel Regression - AVOID GLM and LM, since the independence assumption does not hold anymore
43
What are the pros of a Panel regression?
* account for sample heterogeinity and compute individual-specific estimates * suitable to study the dynamics * minimise bias due to aggregation (like average over time) * enable the control for unobserved variables like cultural factors or difference in business practices across companies (i.e. national policies, federal regulations, international agreements, etc.) * enable more complex hierarchical models (Generalized Additive Models, GAM)
44
What are the cons of a Panel regression?
* data collection (i.e. sampling design, coverage) * unwanted correlation (i.e. same country but different measures) * analyses are much more complex
45
What are the two possible approaches to a panel regression, and how do they work?
Fixed effects (FE) model --> explores the relationship between predictor and outcome variables within a subject Random effects (RE) model --> Assumes that the variation across subjects is random and uncorrelated with the predictors
46
What are the two assumptions of the FE model?
1) correlation between subject’s error term and predictors --> FE removes the effect of time-invariants characeristics, so we can estimate the NET effect of the predictors outcome 2) All the time-invariants characteristics are individual specific so that they are NOT correlated with other individuals CAUTIONS; Lots of dummy variables and increased risk of multicollinearity
47
When do we use the RE model and what is the assumptions behind?
Used when differences across subjects have some influence on the dependent variable Assumption: The subject's error term is uncorrelated with the predictors --> therefore, you can use time-invariant variables as EXPLANATORY variables CAUTIONS: RE needs REALLY strong assumptions, and FE is often more convinceable. Testing the RE can be done through Hausmann test
48
What are the between and within-entity errors in the RE model?
Slide 102 ``` Between = U_it Within = e_it ```
49
What are the two overall types of classifications for clustering?
1) Clustering (Unsupervised learning) - No prior classification, algorithm explores all possible combinations 2) Supervised learning - We do have prior knowledge on the data; we know the labels, or categories etc. - Training a computer to learn a new system we created - Based on data, therefore, the more data, the better the system
50
What are the three partitions of unsupervised learning in clustering, and what are the sub-categories?
Hard clustering 1) Hierarchical methods (agglomerative and divisive) 2) Partinioning methods (non-hierarchical) Soft clustering 3) Fuzzy methods More complex methods 4) Density-based methods 5) Model-based methods
51
In regards to clustering, explain the concept of (statistical) distance
- Needed for every classification method - Distance defines the DISsimilarity between two points - numerous of parameters exist --> think of the dissimilarity matrix slide 118
52
Name two examples of measures of statistical distance
1) Pearson correlation distance | 2) Eisen cosine correlation distance
53
How does hierarchical clustering methods work? and which two types exist? How is visualized?
1) Agglomerative; each observation is considered a cluster. Iteratively, the most similar clusters(leafs) are merged until there on single cluster forms (roots) 2) Divisive; the INVERSE of the agglomerative. Begins with the root and subsequently the most heterogeneous clusters are divided until each observation forms a cluster Thus, no need for defining number of cluster groups beforehand Visualization; Dendrogram
54
What are the pros and cons of the hierarchical method of clustering?
Pros: • No a-priori information about the number of clusters required • Easy to implement • Very replicable Cons: • Not very efficient O(n2 log n) • Based on dissimilarity matrix which has to be choosed in advance • No objective function is directly minimised • The dendrogram is not the best tool to choose the optimum number of clusters • Hard to treat non-convex shapes
55
How does the partioning method of clustering work?
- Simplest method - Requires predefined number of clusters - iterative methods, geometry based 3 most famous types: 1) K-means: each cluster is represented by the center of the cluster 2) K-medoids or PAM: each cluster is represented by one of the points in the cluster 3) CLARA (Clustering LARge Applications): Suitable for large datasets
56
What are the pros and cons of the partioning method of clustering?
Pros: • k-means is relatively efficient O(tkn), with k, t << n • easy to implement and understand • totally replicable Cons: • PAM does not scale well for large datasets • Applicable only when mean is defined (i.e. no categorical data) • Need to specify k in advance • k-means is unable to handle noisy data and outliers. PAM does better • Not suitable to discover clusters with non-convex shapes
57
How do we check if data can be clustered (has a tendency) or wheter we are just synthezising clusters?
- Use the Hopkins statistics test; it measures the probability of your dataset to be uniformly distributed - Aka. it tests the spatial randomness of the data Mechanincs: - Based on each observations distance to its neighbor - Then comparing this distance to a random sample's distance to its neighbor Formula: Average dist. to random neighbor / (ave. dist. to random neighbor + ave. dist .to real neighbor)
58
Three ways to find the optimal number of clusters?
Plenty of methods exist, however we focus only on 3: 1) Silhoutte 2) Gap-statistics 3) Within sum of squares
59
Name the four overall mehtods of supervised clustering
1) Heuristic approach - k-NN (nearest neighbor) 2) Model-based approach - Linear discriminant analysis - Quadratic discriminant analysis - Logistic - Näive Bayes 3) Binary decision - Classification and regression trees (CART) - Random forest 4) Optimisation based - Support Vector Machine (SVM) - Neural Networks
60
What is the Näive Bayes, and how does it work?
- A probalistic machine learning algorithm | - Based on two components; Conditional probability and Bayes Rule
61
What is Bayes rule?
P(Y | X) = ( P(X | Y) * P(Y) ) / P(X)
62
What clustering method/classifier is the best?
4 Criterions: 1) Accuracy: the empirical accuracy computed on the same dataset than the one used to learn the classifier; 2) AccuracyCV: the mean of the accuracies obtained by the resampling scheme; 3) AccuracyInf: a lowest bound on the mean accuracy obtained as follows: (Slide 167) 4) AccuracyPAC: a highly probably bound on the accuracy obtained just by subtracting the standard deviation
63
What are the two fundamental concepts of model evaluation?
Generalization --> a property of a model that is able to be applied to unused data (used for predictions) --> the model generalizes beyond the training data Overfitting --> When a model is tailored to the training data at the expense of generalization to test data - Always present to some extent, thus you have a trade-off between model complexity and overfitting
64
How does one go around assessing overfitting?
- Assessment based on holdout data; comparing the predicted values of the model with hidden true values (holdout data) - See graph slide 184 --> you want not perfect overlap, but rather some distance between your training set and your hold out. - Error is a decreasing function of complexity CAUTION: it is ONLY a way to get a feeling of the generalization, since it provides only a single estimate
65
What are the drivers behind overfitting?
TWO: 1) too many features/independent variables 2) too complex functions (^2, ^3, ^4) (cut these out)
66
What model is more prone to suffer from overfitting?
Logistic model, especially compared to the Support Vector Machine (SVM) model. Flower example; logistic model very sensitive to outliers, which makes the model bad/overfitted
67
What method is superior to holdout data, when evaluating model performance?
Cross-validation (CV)
68
What is cross-validation and how does it work?
- CV performs multiple splits and systematically swaps out samples for testing - Number of partitions is k --> called "folds" (around 5 folds normally) - CV then iterates training and test sets k times; e.g. first test uses fold 1-4 for training, and fold 5 for hold out, then second test uses fold 2-5 for training, and fold 1 for hold out.
69
What is a general method to avoid overfitting?
Ensure that the data is STRICTLY independent of model building - This allows for independent estimates of model accuracy and directly compare multiple models
70
What is a general strategy of parameter optimization?
Regularization (or penalization) | - Don't optimize the fit to the data, optimize the combination of FIT and SIMPLICITY
71
In regards to model evaluation of optimizing parameters, how does regularization work?
2 Steps: 1) Find a set of parameters that maximizes some objective function, which indicates how well the model fits the data 2) Incorporate a PENALTY function, that assesses the importance of adding another parameter to the model Famous models: • Ridge regression: L2-norm --> sum of squares of the weights • LASSO regression: L1-norm --> sum of absolute values (sort of automatic features selection)
72
In regards to model evaluation an model fitting, what is a learning curve?
- When the size of the training data increases, so will the learning curve i.e. the visualization of the generalization performance against the amount of training data is referred to as a learning curve IMPORTANT: (slide 197) • The learning curve shows the performance only on test data plotted against the size of training data • A fitting graph shows the generalization performance and the performance on the training data plotted against model complexity. Here the size of training data is fixed
73
How is accuracy measured given a classifier for model evaluation?
Accuracy = 1 - error rate = # correct decisions / # total decisions Problems; We can have different kinds of correct and incorrect decisions --> confusion matrix
74
How does a confusion matrix work?
If positive versus negative --> 2 x 2 matrix, with observed classes in the columns (O(1) and O(0)) and predicted classes in the rows (P(1) and P(0)) - The main diagonal counts the correct predictors. However: 1) When we have O(1) and P(0), we have False negatives 2) when we have O(0) and P(1) we have False positives
75
Briefly explain how a confusion matrix is important in unbalanced scenarios
When the general distribution is skewed, we are in big trouble - That is, that we can end up in the two FALSE cases more skewed - The anti-diagonal (number of false decisions) is not uniformly distributed across the matrix - In this case you want to count these two false predictions separately, since they have different costs
76
How can we fix the problem of unbalanced scenarios in a confusion matrix?
We can use expected value, to appoint different probabilities to each of the false scenarios, based on previous data --> that is, we build a machine learning model that helps our model understand when it has made a mistake/false prediction - Expected value can be used for two purposes: 1) Framing the usage of the classifier 2) Framing the evaluation of the classifier
77
What are some of the evaluation metrics expected value can use to frame the evaluation of a classifier?
EXAMPLE; targeted marketing for upselling • A false positive (FP) occurs when we classify a consumer as a likely responder and therefore target her, but she does not respond. • A false negative (FN) is a consumer who was predicted not to be a likely responder (so was not offered the product), but would have bought it if offered. * A true positive (TP) is a consumer who is offered the product and buys it. * A true negative (TN) is a consumer who was not offered a deal and who would not have bought it even if it had been offered. - most of the metrics used to evaluate a model are summaries of the confusion matrix
78
What are specificity and sensitivity in terms of a binary classification test?
Sensitivity = True Positive rate (measures ratio of true positive response that are correctly identified) TP / (TP + FN) Specificity = True Negative rate (measures proportion of true negative response that are correctly identified) TN / (TN + FP)
79
What is needed in R to make a plot?
1) data.frame | 2) OR a rectangular representation of the data
80
What is a coordinate system?
Definition: a set of position scales and their relative geometric arrangement Two axes crossing each other, to enable graphical representation of data, based on two points. NOTE: axes do not have to be intercepting at an acute angle, one could be a circle and the other run radially...
81
What is the most widely used coordinate system?
The 2d Cartesian coordinate system
82
What happens in a Cartesian coordinate system, when you change the units of your data?
As long as the transformation is linear, the Cartesian coordinate system is INVARIANT --> graphical output will be the same, even if figures/placing of gridlines are not identical Example; fahrenheit to celcius (slide 236)
83
What are examples of non-linear coordinates, and when are they used?
When distance between data is not linear, and you need to adjust the scale Examples: (slide 237) - Log transformation - Logarithmic transformation (most common)
84
When is the log scale coordinate system preferred?
When mulitplication/division is applied, since it applies addition/subtraction in log-scale
85
What is a polar coordination system? and when are they useful
Like a scope for shooting a + with a circle around it: The cross is then 1,2,3,4 and then you add circles around that correspond to different layers; 1,2,3,4 Useful for data of periodic nature --> temperature in a region varying each month --> shows the circular relationship --> starting value joins the ending value Slide 239-240
86
What is a geospatial coordinate system?
based on maps REMEMBER the earth is a sphere, so do not use Cartesian Coordinates, use something else: - Lambert equal area or Transverse Mercator
87
What plots are useful representing amounts?
Bar plots
88
What plots are useful representing distributions?
Box plots, violin plots, stacked histograms, overlapping densities
89
What plots are useful representing proportions?
Mosaic plots, multiple pie charts, stacked bars, parallel sets
90
What plots are useful representing relationships?
Line-graphs, connected scatter plot, smooth line graphs, correlogram
91
What plots are useful representing uncertainty?
Error bars, confidence bands on regressions
92
What is the difference between structured and unstructured data?
STRUCTURED DATA IS COMPRISED OF CLEARLY DEFINED DATA TYPES WHOSE PATTERN MAKES THEM EASILY SEARCHABLE UNSTRUCTURED DATA IS EVERYTHING ELSE – DATA THAT IS USUALLY NOT AS EASILY SEARCHABLE, INCLUDING FORMATS LIKE AUDIO, VIDEO AND SOCIAL MEDIA POSTINGS AND TEXT
93
What is natural language processing?
Definition: defined as the application of computational techniques to the analysis and synthesis of natural language and speech, by representing texts mathematically --> Can go from word-count to a neural network
94
What is the source called in NLP?
The corpus
95
What is the difference bewteen computational linguistics and NLP?
Computational Linguistics is a more theoretical field that develops computational methods to answer the scientific questions from the point of view of linguists Natural Language Processing is dedicated to give solutions to engineering problems related to natural language, focusing on the people
96
What is the typical workflow in an NLP setting?
Collection of docs --> pre-processing --> exploratory research --> represntation of relevant features in usable vector space --> apply it to a model
97
What are the 7 problems in NLP?
1) Ambiguity --> words can have several meanings depending on context 2) Synonomy --> we can express the same idea with different terms ("fine" as example) 3) Syntax --> language structure, based on rules, you can reorder sentences 4) Coreference --> "the "firm" is (...), "it" is therefore" 5) Normalization versus information --> CONSULTing, CONSULTant, but we loose some information by normalizing 6) Representation --> word embedding (man--> woman, King --> queen), the transformation of words into numbered vectors 7) Style --> sarcasm as example, way of saying things
98
In terms of textual ambiguity what does signifier and signified mean?
Signifier = the way we represent the information --> with a word Signified = mental concept, the meaning of that information Ambiguity then happens because of the difference between the two
99
What is RegEx?
Programming language used to clean textual data with regular expressions, we do not want in there. Start by removing most evident stuff we don't want, and replace with space e.g., continue this process, until you have a human readable document.
100
In the bag of words NLP technique, how does the corpus work?
Consists of N documents (D), which each consist of N number of words (w)
101
What are some of the limitations of the bag of words NLP technique?
Does not consider - grammar - word order - sentence structure - punctuation - any relationship between words
102
What is inverse document frequency? and what are the two considerations you need to take into account, when using it?
A measure of whether a given term is common or rare in a corpus. TWO considerations: - Should not be too rare --> not meaningful for a cluster - Should not be too common --> if too common, it doesn't distinguish anything
103
In relation to NLP, what is keyness?
A measure associated with features that occur differentially across different categories - It gives the distinguishing features of the corpora Example with cat and dog being "keyness" words: • Corpus A: cat 52; dog 17; cow 31 • Corpus B: cat 9; dog 40; cow 31
104
In NLP, what is lexical dispersion?
Visualization of occurences of a term in a context We want to have an informative measure which communicates where the term has been used in the text (President example; who said future when in their speeches) => x-ray plot
105
Two methods of processing the test in NLP?
1) Lemmatization: Better and good are in the same lemma --> a process by which inflected forms of words are grouped together to be analyzed as a single aspect 2) stemming; similar, but instead you break all words down to its stem/base root form
106
What do we use Syntactic Parsing for in NLP?
- involves the analysis of words in the sentence for grammar and their arrangement in a manner that shows the relationships among the words Important attributes of text syntactics: - Dependency grammar - Part of speech tags
107
What are the three ways we can achieve syntactic parsing?
- Part-of-Speech tagging (POS) (noun, verb, adjective) - Dependency parsing (analyzes the grammatics of a sentence, linking head words with other words, which modify the head word) - Named Entity Recognition (ENR) (identifies persons, and names/organizations)
108
What is meant by tone/sentiment and how can we measure it?
Positve versus negative tone --> Used in KNOWN data categories Two ways to model - Lexicon based (pure words counting, more sophisticated) - Dee Learning
109
How does the lexicon approach work in terms of computing the narrative tone of a text?
- One relies on hash tables, containing categories of words, like positive and negative - Then by matching these lexica with the corpus, you get an indication of negative versus positive tone
110
What is topic modeling?
- Detection of hidden patterns or topics in the corpus Three models: - LSA - Latent dirichelet allocation (LDA) - CTM etc.
111
How does an LDA model work?
- generative statistical model which allows us to explain observations through unobserved characteristics - The intuition is that each document is viewed as a mixture of a given set of topics - Each topic is then viewed as a mixture of a set of words
112
In the LDA, how does one establish how many topics are optimal?
NOT rigourous way: PERPLEXITY: detecting optimal model by computing incremental power (look at graphs slide 314) MORE rigourous way: defining a chi-square to test
113
How is AI, ML and DL related?
AI is the overall circle, then inside AI, we have ML. And inside ML we have DL.
114
Definition of AI
the effort to automate intellectual tasks normally performed by humans”
115
What is symbolic AI?
Reach human-level AI by coding a sufficiently large set of explicit rules for manipulating knowledge
116
How can we think of the change of the paradigm from symbolic AI to Machine Learning (ML)?
Classical programming: - INPUTS: Rules + data - OUTPUT: Answer Machine learning: - INPUTS: Data + Answers (from classical) - Output: rules (that feed in again)
117
How do we characterize an ML system and how is different from statistics?
- It is trained rather than programmed (so relies on a bunch of examples, which it has made rules from) - Unlike statistics, ML is able to manage large and complex datasets which classical statistics like Bayesian one would be impractical - As a result, ML, and especially deep learning, exhibits comparatively little mathematical theory and is engineering oriented - It’s a hands-on discipline in which ideas are proven empirically more often than theoretically
118
What three things are needed to do ML? And how does this help defining ML?
1) Input data points: e.g. images 2) Examples of the expected output: e.g. images of cats and dogs 3) A way to measure the performance of the algorithm: LEARNING through FEEDBACK ML searches for useful representations of some input data, within a predefined space of possibilities, using guidance from a feedback signal
119
What is the overarching central problem in ML?
To meaningfully transform data, to learn useful representations of the input data
120
What is deep learning, and what does "deep" refer to? How does this related to a Neural Network?
- Subfield of ML, where the learning process is taken even further to successive layers of increasingly meaningful representations DEEP/DEPTH = # of layers that contribute to a model In DL the layered representation is almost always learned through models called NEURAL NETWORKS
121
What is the definition of a Neural Network?
“...a computing system made up of a number of simple, highly interconnected processing elements, which process information by their dynamic state response to external inputs."
122
Explain what is meant by a neuron in a neural network, and how each individual neuron works
Neuron is basically a node, which can be inside either a hidden layer or an output layer Works: Input from axon from a neuron --> Synapse (weight) --> carried by the dendrite (w*x) to --> cell body, which summarizes all inputs and adds a bias --> then shot out into activation function --> through output axon
123
What is a layer and what is it constituted of?
Layers can be either input or hidden or output --> the more layers the deeper the learning Layers are made out of interconnected nodes/neurons
124
How do we determine how many layers there are inside a NNET?
Input layer (DOES NOT COUNT) Hidden layers (just count amount of hidden layers) Output layer (just one?)
125
How do we find # of neurons, # of biases, # of weights and # of parameters in a NNET?
Neurons; count the number of neurons present in each hidden layer and output layer Biases; ONE bias per neuron Weights; depends on how many inputs there are into each neuron. # of inputs * # neurons = # of weigths Parameters; biases + weights
126
How does DL and NNET work in terms of image recognition?
The more layers, the more filtered, the more different each layer becomes from the original photo, but also more informative about the output DL is considered a multistage way to learn data representations
127
How does parametrization and evaluation work in a NNET?
An NNET is parametrized by its weights --> the goal is then to find the correct value of these
128
What is the loss function in a NNET?
A function that controls the output of a neural network by measuring how far the output is from what we expect Function that compares predictions with true targets
129
What is backpropagation in NNET?
The use of the DL loss score acquired from the loss function, which is shot into an OPTIMIZER that implements the feedback mechanism called backpropagation - This makes adjustments by updating the weights, i.e. "training" the network
130
What is considered the gears of a NNET? and how do they work?
Tensors or Tensor operations --> generalizations of vectors an matrices which we used to play with in linear algebra - They are just geometric transformations of input data
131
What is the overall purpose of the Activiation Function in a NNET?
To check the output variable Y of the neuron, and decide whether other connections should consider this neuron as activated or not
132
What 5 Activation Functions have we been discussing?
1) Step function (1 or 0, based on threshold) 2) Linear function (continuous line, providing a range of activations) 3) Sigmoid function (a smooth step function like, still binary outcomes of 0 or 1) 4) Tanh function (sister to Sigmoid range from -1, 1) 5) ReLU (Rectified Linear Unit): A non-linear function, that looks like a linear one --> often a preferred model
133
NNET: Pros and cons of using a STEP activation function?
- Easy - Binary activation (either 0 or 1) - Does NOT support multiple neurons
134
NNET: Pros and cons of using a LINEAR activation function?
- Support multiple neurons - Gradient is constant (no change in variations because of linearity) - If NN is with many layers, the activation function in the last layer is just a linear combination of the first one (low complexity)
135
NNET: Pros and cons of using a SIGMOID activation function?
- Non-linear combinations - Support for multiple activations - Output range is from 0 to 1 (like logistic, therfore GOOD for probability) - offers VERY SMALL GRADIENTS at boundaries
136
NNET: What is GRADIENTS within the activation function?
Gradients measure changes in variation --> the smoother the curve, the lower the gradients
137
NNET: Pros and cons of using a TANH activation function?
- Output range can be -1 to 1 - Stronger gradients than Sigmoid, which is not good - Small gradients at boundaries though
138
NNET: Pros and cons of using a ReLU activation function?
- Only gives an output, if x is positive, and 0 otherwise - A bit like a stepwise linear function, but should be though of as a combination between a linear and a non-linear function - NO upper-boundary as compared to other logistic models - Enable sparse activations (i.e. few neurons to not activiate) - Gradient for x < 0, is not existing, not reponse from neurons - It IS LESS COMPUTATIONALLY EXPENSIVE than Tanh and Sigmoid, which makes it very PREFERABLE
139
How does one choose an activation function?
No predefined rules: - If you know the function you are trying to replicate, maybe use a.f. similar to that one - Sigmoid are good for classifiers - When NO PRIOR KNOWLEDGE --> RELU is good
140
Why do we need time discretization in NNETs?
t-1, t, t+1 | Fundamental since it significantly simplifies the mathematical implementation of NNETs
141
Describe the role of the bias in Neural Networks
- The bias node is ALWAYS on - Can be thought of as the INTERCEPT in a regression model - If a NNET does not have a bias node in a given layer, it will not be able to produce an output in the next layer that differs from 0 in a linear scale (or any transformation of 0 when passed to the a.f.)
142
When does a NNET consist of "dense layers"?
When ALL layers and nodes are connected
143
Name the THREE types of NNETs
1) Feedforward Neural Networks --> information moves ONLY in one direction (no cycles) Can be either - SINGLE-Layer perceptron - Multiple Layer Perecptron (MLP) 2) Convolutional Neural Networks 3) Recurrent Neural Networks --> has internal memory and cycling ability
144
NNET; how does the single-layer perceptron Feedforward network work?
ONE output layer that receives inputs and provides an output (input layer does not count)
145
NNET; How does the multiple-layer perceptron Feedforward network work?
- Consists of multiple layers of computational units - Each neuron is then connected to each other in an Feedforward manner - Sigmoid is typically used as a.f. - They can learn non-linear representations
146
NNET; How does the Convolutional Neural Network (CNN) work?
- Evolution of the MLP - Only a limited region (receptive area) reponds to the stimuli and then activate - Requires minimum preprocessing - Very widely used in image recognition
147
What are the three types of layers used in a Convolutional Neural Network (CNN)?
1) Convolutional layer --> takes two signals and produces a third, extracting features of an image e.g. 2) Pooling layers --> placed in bewteen convolutional layers, to summarize. - This layer reduces computational costs by reducing number of weights (parameters) up to 75% - Also controls for overfitting and enables generalization 3) Dropout layers --> chop off irrelevant connections to avoid: - Unnecessary computations and overfitting - Done by randomly setting layers to 0, but only done in the learning stage on the training set - Set to 0 because the network goal is to be redundant, so that it can provide the right classification without all layers activated
148
What is a Reccurent Neural Network?
- Neural Network with feedback loops - Involves backpropagation - Applied to sequential tasks like handwriting and speech recognition
149
What is web scraping?
Web scraping is a technique for converting the data present in unstructured format (HTML tags) over the web to the structured format which can easily be accessed and used. This means that you are going to build a data.frame or a corpus at the end of the day
150
What is the vanishing gradient problem? and in which a.f functions is it present?
- The problem of vanishing gradients arises due to the nature of the backpropagation optimization - Gradients tend to get smaller and smaller as we keep on moving backward - Implies that neurons in earlier layers learn very slowly compared to neurons in the last layers - Vanishing Gradient Problem results in a decrease in the prediction accuracy of the model and take a long time to train a model.