Class 1-12 Flashcards
What are the aims of numerical summaries of discrete variables?
- Aim is to describe the distribution of the variable.
- Question to address is : What are the relative frequencies of different categories? Which categories are common and which are rare?
- Since a categorical variable takes a finite number of possible values, the simplest thing to do is tabulate the number of occurances of each type.
What are the aims of numerical summaries of continuous variables?
- Aim is to summarize the data in terms of its distribution.
* It is common to start with some descriptive statistics to get a feeling for the data.
What is the standard deviation?
• Is a measure of how spread out numbers are;
it is the square root of the Variance.
• Variance is the average of the squared differences from the Mean.
a) Calculate Mean (the simple average of the numbers)
b) Then for each number: subtract the Mean and square the result (the squared difference).
c) Sum up those squared differences / (n-1)
What is exploratory data analysis? (EDA)
• is the process of analyzing and visualizing the data to get a better understanding of the data and glean insight from it.
How Does Exploratory Data Analysis Differ
from Summary Analysis?
Summary:
A summary analysis is a numeric reduction of a historical data set.
Quite passive and focused on
the past.
Exploratory:
Aims to gain insight into the engineering/scientific process behind the data
Active and futuristic.
What is “variation”?
Is the tendency of the values of a variable to change from measurement to measurement.
• Measuring any continuous variable twice, will give two different results.
• Categorical variables can vary if you measure across different subjects (e.g., eye colors of people), or different times (e.g., the energy levels).
• Every variable has its own pattern of variation, which can reveal interesting information. The best way to understand that pattern is to visualize the distribution of the variable’s values.
What is a “Histogram”?
A histogram is similar to a bar plot. Categorizes a continuous variable content into non-overlapping intervals for the sake of display (=binning).
What is a “Density Curve”?
the y-axis represents the probability of observing any given value, such that the area under the curve equals one.
What is a “Box Plot”?
Graphical representation of the five-number summary
• Depicts quartiles (i.e., the 25%, 50%, and 75% quantiles), minimum, maximum and outliers (if present).
• Conveys the shape of the data distribution, the presence of extreme values, and the ability to compare with other variables using the same scale
• Excellent tool for screening data, determining thresholds for variables and developing working hypotheses.
What is a Normal Distribution and
Why Should You Care?
- Many statistical methods are based on the properties of a normal distribution.
- Applying certain methods to data that are not normally distributed can give misleading or incorrect results.
- Most methods that assume normality is robust enough for all data except the very abnormal.
What are attributes of the “Gaussian Distribution”
• Has the following properties
- Gaussian distributions are symmetric around their mean.
- The mean, median, and mode of a Gaussian distribution are equal.
- The area under the curve is equal to 1.0.
- Gaussian distributions are denser in the center and less dense in the tails.
- Gaussian distributions are defined by two parameters, the mean and the standard deviation.
- 68% of the area under the curve is within one standard deviation of the mean.
- Approximately 95% of the area of a Gaussian distribution is within two standard deviations of the mean.
What is a “Scatterplot”
For continuous variables, the most common visualization technique is the scatterplot, which simply maps each variable to an x- or y-axis coordinate.
When can we make use of visualization tools?
- visual exploration is the first thing when dealing with a new task
- when analyzing models’ performance
- for sharing insights & reporting results
What is the iterative process of EDA?
- generate questions about the data
- search for answers by visualizing, transforming, and modeling the data
- use new knowledge to ask better or new questions
Define “Data Science”
• deals with large volumes of comlex data from multiple sources
• aims to develop methods, tools, or services capable of
a. ingesting such data
b. generating semiautomated decision-support systems
What is “Descriptive Analytics”?
- goal: understand the past and present
* tools: summary statistics, correlations, visualizations