Sect 6.1- exploratory data, initial investigations, spot anomalies, hypotheses, visualization, & features/values Flashcards
Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods. It helps determine how best to manipulate data sources to get the answers you need, making it easier for data scientists to discover patterns, spot anomalies, test a hypothesis, or check assumptions.
EDA is primarily used to see what data can reveal beyond the formal modeling or hypothesis testing task and provides a provides a better understanding of data set variables and the relationships between them. It can also help determine if the statistical techniques you are considering for data analysis are appropriate. Originally developed by American mathematician John Tukey in the 1970s, EDA techniques continue to be a widely used method in the data discovery process today.
Why is exploratory data analysis important in data science?
The main purpose of EDA is to help look at data before making any assumptions. It can help identify obvious errors, as well as better understand patterns within the data, detect outliers or anomalous events, find interesting relations among the variables.
Data scientists can use exploratory analysis to ensure the results they produce are valid and applicable to any desired business outcomes and goals. EDA also helps stakeholders by confirming they are asking the right questions. EDA can help answer questions about standard deviations, categorical variables, and confidence intervals. Once EDA is complete and insights are drawn, its features can then be used for more sophisticated data analysis or modeling, including machine learning.
Exploratory data analysis tools
Specific statistical functions and techniques you can perform with EDA tools include:
Clustering and dimension reduction techniques, which help create graphical displays of high-dimensional data containing many variables.
Univariate visualization of each field in the raw dataset, with summary statistics.
Bivariate visualizations and summary statistics that allow you to assess the relationship between each variable in the dataset and the target variable you’re looking at.
Multivariate visualizations, for mapping and understanding interactions between different fields in the data.
K-means Clustering is a clustering method in unsupervised learning where data points are assigned into K groups, i.e. the number of clusters, based on the distance from each group’s centroid. The data points closest to a particular centroid will be clustered under the same category. K-means Clustering is commonly used in market segmentation, pattern recognition, and image compression.
Predictive models, such as linear regression, use statistics and data to predict outcomes.
Types of exploratory data analysis
There are four primary types of EDA:
Univariate non-graphical. This is simplest form of data analysis, where the data being analyzed consists of just one variable. Since it’s a single variable, it doesn’t deal with causes or relationships. The main purpose of univariate analysis is to describe the data and find patterns that exist within it.
Univariate graphical. Non-graphical methods don’t provide a full picture of the data. Graphical methods are therefore required. Common types of univariate graphics include:
Stem-and-leaf plots, which show all data values and the shape of the distribution.
Histograms, a bar plot in which each bar represents the frequency (count) or proportion (count/total count) of cases for a range of values.
Box plots, which graphically depict the five-number summary of minimum, first quartile, median, third quartile, and maximum.
Multivariate nongraphical: Multivariate data arises from more than one variable. Multivariate non-graphical EDA techniques generally show the relationship between two or more variables of the data through cross-tabulation or statistics.
Multivariate graphical: Multivariate data uses graphics to display relationships between two or more sets of data. The most used graphic is a grouped bar plot or bar chart with each group representing one level of one of the variables and each bar within a group representing the levels of the other variable.
Other common types of multivariate graphics include:
Scatter plot, which is used to plot data points on a horizontal and a vertical axis to show how much one variable is affected by another.
Multivariate chart, which is a graphical representation of the relationships between factors and a response.
Run chart, which is a line graph of data plotted over time.
Bubble chart, which is a data visualization that displays multiple circles (bubbles) in a two-dimensional plot.
Heat map, which is a graphical representation of data where values are depicted by color.
Python: An interpreted, object-oriented programming language with dynamic semantics. Its high-level, built-in data structures, combined with dynamic typing and dynamic binding, make it very attractive for rapid application development, as well as for use as a scripting or glue language to connect existing components together. Python and EDA can be used together to identify missing values in a data set, which is important so you can decide how to handle missing values for machine learning.
R: An open-source programming language and free software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing. The R language is widely used among statisticians in data science in developing statistical observations and data analysis.
Multivariate statistics
subdivision of statistics encompassing the simultaneous observation and analysis of more than one outcome variable, i.e., multivariate random variables. Multivariate statistics concerns understanding the different aims and background of each of the different forms of multivariate analysis, and how they relate to each other. The practical application of multivariate statistics to a particular problem may involve several types of univariate and multivariate analyses in order to understand the relationships between variables and their relevance to the problem being studied.
In addition, multivariate statistics is concerned with multivariate probability distributions, in terms of both
how these can be used to represent the distributions of observed data;
how they can be used as part of statistical inference, particularly where several different quantities are of interest to the same analysis.
Certain types of problems involving multivariate data, for example simple linear regression and multiple regression, are not usually considered to be special cases of multivariate statistics because the analysis is dealt with by considering the (univariate) conditional distribution of a single outcome variable given the other variables.
Multivariate analysis (MVA) is based on the principles of multivariate statistics. Typically, MVA is used to address the situations where multiple measurements are made on each experimental unit and the relations among these measurements and their structures are important.[1] A modern, overlapping categorization of MVA includes:[1]
Normal and general multivariate models and distribution theory
The study and measurement of relationships
Probability computations of multidimensional regions
The exploration of data structures and patterns
Multivariate analysis can be complicated by the desire to include physics-based analysis to calculate the effects of variables for a hierarchical “system-of-systems”. Often, studies that wish to use multivariate analysis are stalled by the dimensionality of the problem. These concerns are often eased through the use of surrogate models, highly accurate approximations of the physics-based code. Since surrogate models take the form of an equation, they can be evaluated very quickly. This becomes an enabler for large-scale MVA studies: while a Monte Carlo simulation across the design space is difficult with physics-based codes, it becomes trivial when evaluating surrogate models, which often take the form of response-surface equations.
Univariate
term commonly used in statistics to describe a type of data which consists of observations on only a single characteristic or attribute. A simple example of univariate data would be the salaries of workers in industry.[1] Like all the other data, univariate data can be visualized using graphs, images or other analysis tools after the data is measured, collected, reported, and analyzed
Categorical univariate data
consists of non-numerical observations that may be placed in categories. It includes labels or names used to identify an attribute of each element. Categorical univariate data usually use either nominal or ordinal scale of measurement.[3]
Numerical univariate data
consists of observations that are numbers. They are obtained using either interval or ratio scale of measurement. This type of univariate data can be classified even further into two subcategories: discrete and continuous.[4] A numerical univariate data is discrete if the set of all possible values is finite or countably infinite. Discrete univariate data are usually associated with counting (such as the number of books read by a person). A numerical univariate data is continuous if the set of all possible values is an interval of numbers. Continuous univariate data are usually associated with measuring (such as the weights of people).
In statistics, If a data distribution is approximately normal then about 68% of the data values lie within one standard deviation of the mean and about 95% are within two standard deviations, and about 99.7% lie within three standard deviations.
Therefore, if you have any data point that is more than 3 times the standard deviation, then those points are very likely to be anomalous or outliers.
import numpy as np
import matplotlib.pyplot as plt
seed(1)
multiply and add by random numbers to get some real values
data = np.random.randn(50000) * 20 + 20
Function to Detection Outlier on one-dimentional datasets.
def find_anomalies(data):
#define a list to accumlate anomalies
anomalies = []
# Set upper and lower limit to 3 standard deviation random_data_std = std(random_data) random_data_mean = mean(random_data) anomaly_cut_off = random_data_std * 3 lower_limit = random_data_mean - anomaly_cut_off upper_limit = random_data_mean + anomaly_cut_off print(lower_limit) # Generate outliers for outlier in random_data: if outlier > upper_limit or outlier < lower_limit: anomalies.append(outlier) return anomalies
find_anomalies(data)
Box plots are a graphical depiction of numerical data through their quantiles. It is a very simple but effective way to visualize outliers. Think about the lower and upper whiskers as the boundaries of the data distribution. Any data points that show above or below the whiskers, can be considered outliers or anomalous. Here is the code to plot a box plot:
import seaborn as sns
import matplotlib.pyplot as plt
sns.boxplot(data=random_data)
Boxplot Anatomy:
The concept of the Interquartile Range (IQR) is used to build the boxplot graphs. IQR is a concept in statistics that is used to measure the statistical dispersion and data variability by dividing the dataset into quartiles.
In simple words, any dataset or any set of observations is divided into four defined intervals based upon the values of the data and how they compare to the entire dataset. A quartile is what divides the data into three points and four intervals.
Interquartile Range (IQR) is important because it is used to define the outliers. It is the difference between the third quartile and the first quartile (IQR = Q3 -Q1). Outliers in this case are defined as the observations that are below (Q1 − 1.5x IQR) or boxplot lower whisker or above (Q3 + 1.5x IQR) or boxplot upper whisker.
DBScan is a clustering algorithm that’s used cluster data into groups. It is also used as a density-based anomaly detection method with either single or multi-dimensional data. Other clustering algorithms such as k-means and hierarchal clustering can also be used to detect outliers. In this instance, I will show you an example of using DBScan but before we start, let’s cover some important concepts. DBScan has three important concepts:
Core Points: In order to understand the concept of the core points, we need to visit some of the hyperparameters used to define DBScan job. First hyperparameter (HP)is min_samples. This is simply the minimum number of core points needed in order to form a cluster. second important HP is eps. eps is the maximum distance between two samples for them to be considered as in the same cluster.
Border Points are in the same cluster as core points but much further away from the centre of the cluster.
Everything else is called Noise Points, those are data points that do not belong to any cluster. They can be anomalous or non-anomalous and they need further investigation. Now, let’s see some code.
from sklearn.cluster import DBSCAN
seed(1)
random_data = np.random.randn(50000,2) * 20 + 20
outlier_detection = DBSCAN(min_samples = 2, eps = 3)
clusters = outlier_detection.fit_predict(random_data)
list(clusters).count(-1)
Isolation Forest is an unsupervised learning algorithm that belongs to the ensemble decision trees family. This approach is different from all previous methods. All the previous ones were trying to find the normal region of the data then identifies anything outside of this defined region to be an outlier or anomalous.
This method works differently. It explicitly isolates anomalies instead of profiling and constructing normal points and regions by assigning a score to each data point. It takes advantage of the fact that anomalies are the minority data points and that they have attribute-values that are very different from those of normal instances.
. This algorithm works great with very high dimensional datasets and it proved to be a very effective way of detecting anomalies.
from sklearn.ensemble import IsolationForest
import numpy as np
np.random.seed(1)
random_data = np.random.randn(50000,2) * 20 + 20
clf = IsolationForest( behaviour = ‘new’, max_samples=100, random_state = 1, contamination= ‘auto’)
preds = clf.fit_predict(random_data)
preds
Random Cut Forest (RCF) algorithm is Amazon’s unsupervised algorithm for detecting anomalies. It works by associating an anomaly score as well. Low score values indicate that the data point is considered “normal.” High values indicate the presence of an anomaly in the data. The definitions of “low” and “high” depend on the application but common practice suggests that scores beyond three standard deviations from the mean score are considered anomalous.
The great thing about this algorithm is that it works with very high dimensional data. It can also work on real-time streaming data (built in AWS Kinesis Analytics) as well as offline data.
The Z-score measures how far a data point is away from the mean as a signed multiple of the standard deviation. Large absolute values of the Z-score suggest an anomaly.
zscore = (x - avg) / stddev
x is the current value of the data point you are evaluating
avg is the mean of the population
stddev is the standard deviation of the population
Real-time anomaly detection¶
Anomaly detection systems are really valuable when you can react to unexpected events as they happen. You need a product that has at least these three features:
Able to handle large production datasets, updated very frequently, ideally from an events stream (think for instance in the volume of HTTP requests coming from a CDN provider)
Customizable: Even when you use an algorithm that can learn from itself, it’s sometimes required some domain knowledge to support corner cases, avoid false positives, etc.
Easy to integrate: Anomaly detection is just one step of the process, but then you might need integrate to an alerting system, a visualization software to understand or explor
Caveats¶
Some caveats when manually building this kind of anomaly detection processes:
Normalize your data: quantities in the same units, dates in the same timezone, etc.
When talking about real-time, make sure everyone understands it as something that happened in the last 10 to 20 seconds.
Avoid false positives as much as possible. There’s nothing worse in an alerting system, than being notified from events that don’t require your attention.
Add context to your notifications: a graph snapshot of the time-series and data point that fired the alert, brief explanation on the reason, thresholds configured, actionable links or tools for further exploration, etc.
Graphical representation
a way of analysing numerical data. It exhibits the relation between data, ideas, information and concepts in a diagram. It is easy to understand and it is one of the most important learning strategies. It always depends on the type of information in a particular domain. There are different types of graphical representation. Some of them are as follows:
Line Graphs – Line graph or the linear graph is used to display the continuous data and it is useful for predicting future events over time.
Bar Graphs – Bar Graph is used to display the category of data and it compares the data using solid bars to represent the quantities.
Histograms – The graph that uses bars to represent the frequency of numerical data that are organised into intervals. Since all the intervals are equal and continuous, all the bars have the same width.
Line Plot – It shows the frequency of data on a given number line. ‘ x ‘ is placed above a number line each time when that data occurs again.
Frequency Table – The table shows the number of pieces of data that falls within the given interval.
Circle Graph – Also known as the pie chart that shows the relationships of the parts of the whole. The circle is considered with 100% and the categories occupied is represented with that specific percentage like 15%, 56%, etc.
Stem and Leaf Plot – In the stem and leaf plot, the data are organised from least value to the greatest value. The digits of the least place values from the leaves and the next place value digit forms the stems.
Box and Whisker Plot – The plot diagram summarises the data by dividing into four parts. Box and whisker show the range (spread) and the middle ( median) of the data.
General Rules for Graphical Representation of Data
There are certain rules to effectively present the information in the graphical representation. They are:
Suitable Title: Make sure that the appropriate title is given to the graph which indicates the subject of the presentation.
Measurement Unit: Mention the measurement unit in the graph.
Proper Scale: To represent the data in an accurate manner, choose a proper scale.
Index: Index the appropriate colours, shades, lines, design in the graphs for better understanding.
Data Sources: Include the source of information wherever it is necessary at the bottom of the graph.
Keep it Simple: Construct a graph in an easy way that everyone can understand.
Neat: Choose the correct size, fonts, colours etc in such a way that the graph should be a visual aid for the presentation of information.
Graphical Representation in Maths
In Mathematics, a graph is defined as a chart with statistical data, which are represented in the form of curves or lines drawn across the coordinate point plotted on its surface. It helps to study the relationship between two variables where it helps to measure the change in the variable amount with respect to another variable within a given interval of time. It helps to study the series distribution and frequency distribution for a given problem. There are two types of graphs to visually depict the information. They are:
Time Series Graphs – Example: Line Graph
Frequency Distribution Graphs – Example: Frequency Polygon Graph
Principles of Graphical Representation
Algebraic principles are applied to all types of graphical representation of data. In graphs, it is represented using two lines called coordinate axes. The horizontal axis is denoted as the x-axis and the vertical axis is denoted as the y-axis. The point at which two lines intersect is called an origin ‘O’.
Consider x-axis, the distance from the origin to the right side will take a positive value and the distance from the origin to the left side will take a negative value. Similarly, for the y-axis, the points above the origin will take a positive value, and the points below the origin will a negative value.
Generally, the frequency distribution is represented in four methods, namely
Histogram
Smoothed frequency graph
Pie diagram
Cumulative or ogive frequency graph
Frequency Polygon
xample for Frequency polygonGraph
Here are the steps to follow to find the frequency distribution of a frequency polygon and it is represented in a graphical way.
Obtain the frequency distribution and find the midpoints of each class interval.
Represent the midpoints along x-axis and frequencies along the y-axis.
Plot the points corresponding to the frequency at each midpoint.
Join these points, using lines in order.
To complete the polygon, join the point at each end immediately to the lower or higher class marks on the x-axis.
Summary Statistics: Measures of location
Measures of location tell you where your data is centered at, or where a trend lies.
Mean (also called the arithmetic mean or average).
Geometric mean (used for interest rates and other types of growth).
Trimmed Mean (the mean with outliers excluded).
Median (the middle of a data set).