Exploring Data and R Flashcards
a preliminary exploration of the data to better understand its characteristic.
Data Exploration
are numbers that summarize properties of the data.
Summary Statistics
is the percentage of time the value occurs.
Frequency
is the most frequent attribute value.
Mode
2 MEASURES OF LOCATION
- Mean
- Median
is the most common measure of the location of a set of points.
Mean
alternative of mean since it is very sensitive to outlier.
Median
2 WAYS TO MEASURE SPREAD
- Range
- Variance of Standard Deviation
is the difference between max and min.
Range
is the most common measure of the spread of a set of points.
Variance of Standard Deviation
is the conversion of data into a visual or tabular format so that the characteristics of the data and the relationships among data items or attributes can be analyzed or reported.
Visualization
12 VISUALIATION TECHNIQUES / METHODS
- Representation
- Arrangement
- Selection
- Histogram
- Box Plots
- Two Dimensional Histograms
- Scatter Plots
- Contour Plots
- Matrix Plots
- Parallel Coordinates
- Star Plot
- Chernoff Faces
is a visualization technique which is the mapping of information to a visual format.
Representation
is the placement of visual elements within a display.
Arrangement
is the elimination or the deemphasis of certain objects and attributes.
Selection
usually shows the distribution of values of a single variable.
Histogram
simplified version of a PDF/histogram.
Box Plots
shows the joint distribution of the values of two attributes.
Two Dimensional Histograms
attributes values determine the position.
Scatter Plots
useful when a continuous attribute is measured on a spatial grid. They partition the planes into regions of similar values.
Contour Plots
can plot a data matrix.
Matrix Plots
used to plot the attribute values of high-dimensional data.
Parallel Coordinates
similar approach to parallel coordinate, but axes radiate from a central point.
Star Plot
approach associates each attribute with a characteristic of a face.
Chernoff Faces
is a language use statistics system. It is an environment within which many classical and modern statistical techniques have been implemented. for
R
Who developed R
Ross Ihaka & Robert Gentlemen
is a powerful and productive 3rd party user interface for R.
RStudio IDE
RSTUDIO USER INTERFACE
- Console Pane
- Source Pane
- Environment Pane
- Files Pane
this is where you can type and execute command.
Console Pane
a text editor or the script window where you can edit and save a collection of command.
Source Pane
contains object like dataset loaded into R as well as history of all commands executed.
Environment Pane
open files, view plots, install and load packages.
Files Pane
is used for storing data tables. It is a list of vectors of equal length.
Data Frames
2 PLOTTING COMMANDS
- High-Level Plotting Function
- Low-Level Plotting Function
is a plotting commands that creates a new plot on the graphics device.
High-Level Plotting Function
is a plotting commands that adds more information to an existing plot, such as extra points, lines, and labels.
Low-Level Plotting Function
is the most frequently used plotting function.
plot() Function
offers a powerful graphics language for creating elegant and complex plots.
ggplot2 Package
Hadley Wickham
created the ggplot2 package.
is where ggplo2 package was based on.
Grammar of Graphics