Multivariate Statistik Flashcards
correlation vs regression
Correlation
• description of an undirected relationship between two or more variables
• how strong it is
• direction is not known, not existing or we are simply not interested
• phones in household and baby deaths
Regression • description of a directed relationship between two or more variables • one variable influences the other • smoking and cancer • weight and height • model to describe the relationship • model to predict one variable
The Coefficients
- How many variables to include?
- Akaike’s\An Information Criterion (AIC)”
- we stop variable inclusion if AIC can’t be decreased or R^2 can’t be increased
Classication / Regression Trees
- for intermediate number of variables
- classication tree: predict categories (classes)
- regression tree: predict numerical values (class)
- such trees are also called decision trees
- any type of predictor variables can be used
- no linear relationships required
Good Cluster
High Quality Cluster with
- High intra-Class similarity
- Low Inter-class similarity
Depends on dist Mensure and Cluster Methoden:
Good: Smal circles, Long Lines
Bad: bog circles, small Lines
Similarity and distance: Variable: binary
Matching coeff.
Similarity and distance: Variable: categorical
Jaquard Dist.
Sij= a/(a+b+c)
Types of clustering: hierarchical vs partitional
Hierarchical Clustering: A set of nested clusters
organized as a hierarchical tree –> we will get a dendrogram and a cluster id by dendrogram cutting
Partitional Clustering: A division of data objects into non overlapping subsets (clusters) such that each data object is in exactly one subset –> we will only get a cluster id
steps at hierarchical clustering
- no need to specify numbers of cluster k before clustering starts
- the algorithm constructs a tree like hierarchy (dendrogram) which (implicitly) contains all values of k
- on one end of the tree there are k clusters each with one object and on the other end of the tree there is one cluster containing all k objects
Hard vs. soft (fuzzy) clustering
• Hard clustering algorithms:
- assign each pattern to a single cluster during operation
and output
- hclust, diana, kmeans
• Fuzzy clustering algorithms:
{ assign degrees of membership in several groups
{ fanny
{ fanny membership sub-object: soft clustering results
{ fanny clustering sub-object: hard clustering results
K-Means
- Clusterzentren (K) werden zufällig gewählt (zuvor überlegen/festlegen, wie viele Clusterzentren man möchte)
- Jedes Element (Daten) wird dem nächsten Clusterzentrum zugeordnet
- Die Clusterzentren (Mittel des gesamten Clusters) wird neu gesetzt
o Wenn sich dadurch Änderungen ergaben, was die Zuordnung der einzelnen Werte angeht → neu zuordnen
➔ Das ganze so oft wiederholen, bis sich nichts mehr ändert - Problem: Wenn die Punkte immer zufällig gesetzt werden, kann die Verteilung stark vom Startpunkt abhängen
➔ Man muss den Algorithmus oft laufen lassen und K gut schätzen können
es gibt drei Möglichkeiten, die Entfernung zum Cluster zu definieren (bzw. im Cluster) - welche?
average linkage, single linkage, complete linkage
average linkage:
- use the average distance value
- -> average linkage to merge closest rows
complete linkage
o Nächste Entfernung zum weit entferntesten (bzw. größten) Punkt des ersten Clusters
o Problem: große Cluster nehmen selten neue Mitglieder auf
single linkage
{ use the smallest distance value
–> single linkage to merge closest rows
PCA (general)
• Mathematical procedure that transforms a number of (possibly) correlated variables into a (smaller) number of uncorrelated variables (PC’s).
• The first PC accounts for as much of the variability in the data as possible, each succeeding component accounts for as much of the remaining variability as possible.
• PCA is performend on a covariance or a correlation matrix
of your data.
• Use correlation matrix if variances of your variables differ
largely (scale=TRUE).
• Principal components (PC’s) are linear combinations of the original variables weighted by their contribution to
explaining the variance orthogonal dimension.