Multivariate Statistik Flashcards

1
Q

correlation vs regression

A

Correlation
• description of an undirected relationship between two or more variables
• how strong it is
• direction is not known, not existing or we are simply not interested
• phones in household and baby deaths

Regression
• description of a directed relationship between two or more variables
• one variable influences the other
• smoking and cancer
• weight and height
• model to describe the
relationship
• model to predict one
variable
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

The Coefficients

A
  • How many variables to include?
  • Akaike’s\An Information Criterion (AIC)”
  • we stop variable inclusion if AIC can’t be decreased or R^2 can’t be increased
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Classication / Regression Trees

A
  • for intermediate number of variables
  • classication tree: predict categories (classes)
  • regression tree: predict numerical values (class)
  • such trees are also called decision trees
  • any type of predictor variables can be used
  • no linear relationships required
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Good Cluster

A

High Quality Cluster with

  • High intra-Class similarity
  • Low Inter-class similarity

Depends on dist Mensure and Cluster Methoden:
Good: Smal circles, Long Lines
Bad: bog circles, small Lines

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Similarity and distance: Variable: binary

A

Matching coeff.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Similarity and distance: Variable: categorical

A

Jaquard Dist.

Sij= a/(a+b+c)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Types of clustering: hierarchical vs partitional

A

Hierarchical Clustering: A set of nested clusters
organized as a hierarchical tree –> we will get a dendrogram and a cluster id by dendrogram cutting

Partitional Clustering: A division of data objects into non overlapping subsets (clusters) such that each data object is in exactly one subset –> we will only get a cluster id

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

steps at hierarchical clustering

A
  • no need to specify numbers of cluster k before clustering starts
  • the algorithm constructs a tree like hierarchy (dendrogram) which (implicitly) contains all values of k
  • on one end of the tree there are k clusters each with one object and on the other end of the tree there is one cluster containing all k objects
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Hard vs. soft (fuzzy) clustering

A

• Hard clustering algorithms:
- assign each pattern to a single cluster during operation
and output
- hclust, diana, kmeans
• Fuzzy clustering algorithms:
{ assign degrees of membership in several groups
{ fanny
{ fanny membership sub-object: soft clustering results
{ fanny clustering sub-object: hard clustering results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

K-Means

A
  • Clusterzentren (K) werden zufällig gewählt (zuvor überlegen/festlegen, wie viele Clusterzentren man möchte)
  • Jedes Element (Daten) wird dem nächsten Clusterzentrum zugeordnet
  • Die Clusterzentren (Mittel des gesamten Clusters) wird neu gesetzt
    o Wenn sich dadurch Änderungen ergaben, was die Zuordnung der einzelnen Werte angeht → neu zuordnen
    ➔ Das ganze so oft wiederholen, bis sich nichts mehr ändert
  • Problem: Wenn die Punkte immer zufällig gesetzt werden, kann die Verteilung stark vom Startpunkt abhängen
    ➔ Man muss den Algorithmus oft laufen lassen und K gut schätzen können
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

es gibt drei Möglichkeiten, die Entfernung zum Cluster zu definieren (bzw. im Cluster) - welche?

A

average linkage, single linkage, complete linkage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

average linkage:

A
  • use the average distance value

- -> average linkage to merge closest rows

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

complete linkage

A

o Nächste Entfernung zum weit entferntesten (bzw. größten) Punkt des ersten Clusters
o Problem: große Cluster nehmen selten neue Mitglieder auf

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

single linkage

A

{ use the smallest distance value

–> single linkage to merge closest rows

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

PCA (general)

A

• Mathematical procedure that transforms a number of (possibly) correlated variables into a (smaller) number of uncorrelated variables (PC’s).
• The first PC accounts for as much of the variability in the data as possible, each succeeding component accounts for as much of the remaining variability as possible.
• PCA is performend on a covariance or a correlation matrix
of your data.
• Use correlation matrix if variances of your variables differ
largely (scale=TRUE).
• Principal components (PC’s) are linear combinations of the original variables weighted by their contribution to
explaining the variance orthogonal dimension.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

PCA’s usage

A
  • tool for exploratory data analysis
  • recognize patterns, outliers, trends, groups
  • for gene expression studies it find the genes that contributes most to the difference between the groups
  • reduces dimensionality
  • variables (genes) sum up with a certain degree to principal components
17
Q

understanding PCA geometrically and covariance matrix/eigenvektor

A

• Geometrically
- Rotation of space to maximize variance for fewer coordinates
• Covariance matrix and eigenvector
- Eigenvector with largest eigenvalue is first principal component (PC)

18
Q

simple linear regression

A
  • die Gerade durch die Daten (Residuen) muss so angepasst werden, dass die Streuung der umliegenden Datenpunkte minimal ist
    ➔ Fitten
  • Kann man nur bei gleichverteilten, linearen Daten machen (Gerade durch Kurve legen macht keinen Sinn…)
    -folgt dem Prinzip der maximalen likelihood : wahrscheinlichkeit, dass mein model meine Daten generiert soll maximal sein

y=ax+b
(abhängig Variable= regressionskoeff.* unabhänige variable+intercept)

19
Q

multiple lineare regression

A

• multiple regression (multiple predictor variables P,Q,R)
but one outcome
multiple coefficients:
Y = a + bP + cQ + dR