Exam One Flashcards

Question

Steps for calculating probability

Answer 1

- Define experiment; describe the process used to make an observation and the type of observation that will be recorded - List sample points - Assign probabilities to sample points - Determine collection of sample points contained in the event of interest - Sum the sample point's probabilities to get the event

Answer 2

Outcomes in either events A or B or both - Denoted by U. AUB - 'Or' Statement

Answer 3

Outcomes in both events A and B - 'AND' Statement Denoted by n AnB

Answer 4

P(AnB)/P(B)

Answer 5

- Transforming raw data into an understandable format - Helps us to understand and make knowledge discovery of data at the same time

Answer 6

Real-world data tends to be incomplete, noisy, and inconsistent - leads to poor-quality data and models built on the data It provides operations that helps to organize data into a proper form for a better understanding in the data mining process

Answer 7

Incomplete - Lacking attribute values, lacking certain attributes of interest or containing only aggregate data Noisy - Contains too many outliers Intentional - Disguised missing data

Answer 8

Accuracy Completeness Consistency Timeliness Believability Interpretability

Answer 9

- Handling missing data - Outlier detection and removal - Noise reduction

Answer 10

- Scaling - Smoothing - Aggregation - Generalization

Answer 11

- Feature selection - Dimensionality - Numerosity reduction

Answer 12

- Oversampling - Under-sampling

Answer 13

Combining tables

Answer 14

- Fill in missing values - Identify outliers - Smooth out noisy data - Correct inconsistent data

Answer 15

Data is not always available - many tuples have no recorded value for several attributes

Answer 16

- Equipment malfunction - Inconsistent with other recorded data thus deleted - Data not entered due to misunderstanding - Certain data may not be considered important at the time of entry - Missing data may need to be inferred

Answer 17

- Experimental errors - Measurement errors (instrument errors) - Data entry errors (human errors) - Data processing errors (data manipulation or data set unintended mutations) - Sampling errors (extracting or mixing data from wrong or various sources) - Natural (not an error, novelties in data)

Answer 18

- Z-score or Extreme Value Analysis - Interquartile Range Method - Probabilistic and Statistical modeling - Linear Regression Models - Proximity Based Models - Information Theory Models - Information Theory Models - High Dimensional Outlier Detection Methods

Answer 19

- Very effective when values in the feature fit a Gaussian distribution - Easy to implement - It is useful for low-dimensional feature set - Not recommended when data cannot be assumed to be parametric - Eliminate those data with z value greater or less than 3 (-3) - Eliminate .27% data points

Answer 20

Binning : - First sort data and partition it into (equal-frequency) bins - Then can smooth by bin means, median, and bin boundaries Regression: - Smooth by fitting the data into regression functions Clustering: - Detect and remove outliers

Answer 21

Normalization - the process of normalization entails converting numerical values into a new range using mathematical function

Answer 22

Make two variables in different scales comparable Some models may need the data to be normalized before modeling

Answer 23

v' = (v-mina)/(maxa - min a)

Answer 24

Statistical technique designed to detect trends in the presence of noisy data, assuming that the trend is smooth

Answer 25

- Bin Smoothing - Kernels - Local weighted regression

Answer 26

Process of broadening the classification of data into a database Ex: Age groups instead of age Income levels instead of income State instead of county

Answer 27

Obtain a reduced representation of the data set that is much smaller in volume however produces the same analytical results

Answer 28

Complex data analysis may take a very long time to run on the complete data set Additional data does not mean a better result outcomes

Answer 29

Construct new features combining the given features to make the data mining process more efficient

Answer 30

Used to reduce the amount of features

Answer 31

Replace original data by a smaller form of data representation - Parametric - Regression models - Non-parametric - histograms, data sampling and data cube aggregation

Answer 32

- Improved model performance - Interpretability and Simplicity - Identification of important feature variabilities

Answer 33

Pearson Correlation Coefficient: - Measures linear relationship between two continuous variables Chi-squared test: - Used to test if two categorical variables are independent Analysis of Variance - Used to compare one categorical and one continuous variable - Tests if the mean of Variable 1 in different groups of Variable 2 are equal

Answer 34

Combining data from different sources to provide a unified view or dataset

Answer 35

Uneven distribution of classes in a dataset Ex. - Fraud detection data - Spam classification data - Medical diagnosis data

Answer 36

- Bias towards the majority class - Poor generalization - Misleading metrics

Answer 37

- Over-sampling minority class - Under-sampling majority class

Answer 38

1- import libraries 2 - Import data-set 3 - Check out the missing values 4 - See the categorical values 5 - Splitting the data set into Training and Test Set 6 - Feature engineering (scaling, selection, etc.)

Answer 39

Set of all items of interest for a particular decision or investigation

Answer 40

- Subset of population Ex. List of individuals who rented a comedy from Netflix in the past year - Purpose of sampling is to obtain sufficient information to draw a valid inference about a population

Answer 41

Any function of the random variables constituting a random sample is called a statistic

Answer 42

Statistical expression that defines a probability distribution (the likelihood of an outcome) for a discrete random variable

Answer 43

Measure of asymmetry of a distribution

Answer 44

Average of the squared deviations from the mean

Answer 45

Proportion of any distribution that lies within K standard deviations of the mean 1-(1/K^2) K is any positive number greater than 1

Answer 46

Square root of the variance

Answer 47

Set of all items of interest for a particular decision or investigation Ex. - All former Texas A&M ID graduates - All subscribers to Netflix

Answer 48

Subset of the population Ex. List of individuals who rented a comedy from netflix in the past year Purpose of a sample is to obtain sufficient information to draw a valid inference about a population

Answer 49

Used to determine when a change in one variable can result in a change in another

Answer 50

- Both covariance and correlation measure the linear relationship and the dependency between two variables - Covariance indicates the direction of the linear relationship between variables - Correlation measures both the strength and directin of the linear relationship between two variables - Correlation values are standardized - Covariance values are not standardized

Answer 51

Tells us how two variables are related. To find it: 1. Calculate how the two variables change together (covariance) 2. Divide the amount of variation (spread) in each variable using Stdevs

Answer 52

A standardized number between -1 and 1 makes it easy to see how strongly the variables are connected. A value close to 1 means a strong positive relationship and a value close to -1 means a strong negative relationship

Answer 53

All about data

Answer 54

Foundation of statistical analysis

Answer 55

Description of the approach that is used to obtain samples from a population prior to any data collection activity

Answer 56

- Its objectives - Target population - Population Frame - Operational procedures for data collection - Statistical tools for data analysis

Answer 57

Judgment sampling - Expert judgment is used Convenience sampling - collect sample based on convenience

Answer 58

Simple Random Sampling - selecting items from population so that every subset of given size has equal chance of being selected

Answer 59

- selects every nth item from population

Answer 60

- Applied to population divided into subsets and allocates an appropriate proportion of samples to each subset

Answer 61

- Divide population into clusters and sample a set of clusters

Answer 62

- Fix the time and select 'n' items after that time or select 'n' times at random and select the next item produced after each of these items

Answer 63

Provides basis for many useful analyses to support decision making

Answer 64

Assess the value of unknown population parameters (mean, proportion, population variance) using sample data

Answer 65

If sampling is done randomly and correctly, the sample mean will provide a good estimate of the true population mean

Answer 66

Using (n-1) instead of n compensates for the fact that we're estimating based on the sample and accounts for variability in that estimate

Answer 67

Single number derived from the sample to estimate the population parameter - If the long-term average of point estimates from population samples provides a true estimate of the population parameter, the estimator is called unbiased estimator

Answer 68

Difference between observed values of statistic and the quantity it is intended to estimate - Any difference between sampling mean and population mean

Answer 69

Sampling errors: The sample is NOT representative of the population as a whole Non-sampling errors: Systematic errors such as asking the so-called leading questions during an interview

Answer 70

If sample size is large enough, sampling distribution of the mean is: - Aprox. normally distributed regardless of the distribution of the population - Has a mean equal to the population mean - If population is normally distributed, then sampling distribution is also normally distributed for any sample size - Theorem is one of the most practical results in statistics

Answer 71

Mean is the distribution of the means of all possible samples of fixed size n from some population

Answer 72

Standard deviation of the sampling distribution of the mean

Answer 73

- Provide a range for a population characteristic based on sample - Provide a way of assessing the accuracy of a point estimate

Answer 74

x% +- x% Gallup poll reports 56% of voters support certain candidate with margin of error +- 3% - We have a lot of confidence candidate would win since it is [53, 59]

Answer 75

Used for confidence intervals when the population standard deviation is unknown - Only parameter is degrees of freedom

Exam One Flashcards

(101 cards)