CHAPTER 1 Flashcards
Data set
data collected to study info about element
variable
characteristic of an element
measurement
assigning a value of a variable to the element
Quantitative/numerical
answer how much/how many
qualitative/categorical
record several categories an element fall
cross-sectional data
data collected at the same point in time (e,g in a month)
time-series data
data collected over different time periods (e.g: 1999-3000)
primary data
- collected by individual/business
- directly thru planned experimentation/observation
secondary data
from existing sources (by public/private sections)
Steps to start a study
- define variable of interest/response variable
- other variables (factors)
+ can manipulate the value of these factors -> experimental
+ can not manipulate the value of these factors -> observational
Performing survey/observe
- ask abt behaviors, opinions, beliefs, characteristics
- observe behaviors
data warehousing
process of centralised data management -> maintenance + creation => central repository for all org’ data
big data
massive amount of data
fast rates in real time and different forms
population
set of all elements
population of measurements
carry out a measurement to assign a value of a variable to each and every population’s element
Census
examine all population measurements
sample
subset of the elements of population
sample of measurement
measure a charac. of the elements in a sample
descriptive stat.
science of describing the important aspects of a set of measurements
stat. inference
science of using a sample of measurement to make Generalizations abt the important aspects of a population of measurements
random sample
sample selected so that every set of n elements in the population has the same chance of being selected
business analytics
the use of Traditional and newly developed stat. methods, advances in Information systems, and itech from Mana. Science to continuously and iteratively explore and investigate past business performance, with the purpose of gaining insight and improving business planning and operations.
data mining
the process of discovering useful knowledge in extremely large data sets.
sample with replacement
place the element chosen on any particular selection back into the population => give a chance to be chosen on any succeeding selection
sample without replacement
do not place the element chosen on a particular selection back into the population. => cannot choose again => best to sample w/o replacement
frame
a list of all of the population elements
random number table
a table containing random digits that is often used to select a random sample
Process
a process is a sequence of operations that takes inputs (labor, materials, methods, machines, and so on) and turns them into outputs (products, services, and the like)
finite population
a population that contains a finite number of elements
infinite population
a population that is defined so that there is no limit the number of elements that could potentially belong to the population
profitability sampling
sampling where we know the chance (prob.) that each population element will be included in the sample
convenience sampling
not probability sampling
sampling where we select elements because they are easy or convenient to sample
Voluntary respnse sample
overrepresent people with strong (usually negative) opinions
a type of convenience sampling
sampling in which the sample participants self-select
judgement sampling
not probability sampling
sampling where an expert selects population elements that he/she feels are representative of the population
dangerous to use the sample to make stat inferences about the population because it depends upon the judgment of the person selecting the sample
improper sampling
unethical
purposely selecting a biased sample
e.g: using a nonrandom sampling procedure that overrepresents population elements supporting a desired conclusion or that underrepresents the population not supporting the conclusion
misleading charts, graphs, and descriptive measures
unethical
unethical stat practice
inappropriate statiscal analysis or inappropriate interpretation of statiscal results
select many different samples and running many different tests
produce a result that seems to be true but not
descriptive analytics
The use of traditional and more recently developed statistical graphics to present to executives (and sometimes customers) easy-to-understand visual summaries of up-to-the-minute information concerning the operational status of a business.
graphical descriptive analytics
use the traditional and/or newer graphics to present to executives (and sometimes customers) easy-to-understand visual summaries of up-to-the minute info concerning the operation status of a business.
numerical descriptive analytics
association learning, text mining, cluster analysis, and factor analysis.
association learning
identify items that tend to co-occur and finding the rules that describe their co-occurrence.
text mining
The science of **discovering knowledge, insights and patterns ** from a collection of textual documents or databases
using latent semantic analysis
Latent semantic analysis
analyze the relationship between a collection of documents and the words they contain to produce a set of key concepts or factors related to the documents and words
cluster analysis
Finding natural grouping or clusters within data without having to prespecify a set of categories
Factor analysis
Start with a large number of correlated variables and finding fewer underlying, uncorrelated factors that describe the essential aspects of the large number of correlated variables
reducing large number of variables to fewer underlying factors helps a business focus its activities and strategies
predictive analytics
methods used to find anomalies, patterns, and associations in data sets, with the purpose of predicting future outcomes. The applications of predictive analytics include anomaly (outlier) detection, association learning, classification, cluster detection, prediction and factor analysis
supervised learning technique
methods used to predict values of a response variable on the basis of one or more predictor variables.
classification
assign items to a specificed categories or classes
2 classes of predictive analytics
- nonparametric predictive analytics
- parametric
parametric predictive analytics
find a **math equation ** that relates the response variable to the predictor variable(s) and involves unknown parameters that must be estimated and evaluated by using simple data;
parametric predictive analytics include
- classical linear regression
- logistic regression
- discriminate analysis
- neureal networks
- time series forecasting
prescriptive analytics
combine external and internal constraints with results from descriptive or predictive analytics to recommend an optimal course of action
Prescriptive analytics include
- decision theory methods
- linear optimization
- nonlinear optimization
- simulation
supervised learning
uses a training set to teach models to yield the desired output
2 types of quantitative variables
ratio and interval
ratio variable
- quantitative variable
- measured on a scale such that ratios of its values are meaningful
- there is an inherently defined zero value
distance of 0 miles = no distance at all
30 miles is twice as far as 15
Interval variable
- quantitative variable
- ratios are not meaningful
- no inherently defined zero value
0 degree = cold
2 types of qualitative variable
ordinal and nominative
ordinal variable
- qualitative
- meaningful ordering/ranking of the categories
good-average-poor/1->5
nominal variable
gender, color.etc
- qualitative variable
- no meaningful ordering/ranking
sampling design
methods for obtaining a sample
stratified random sample
divide the pop. into nonoverlapping groups of similar elements (strata)
- random sample is selected from each stratum
- these samples are combined to form the full sample
wise to stratify when the pop. consists of 2 or more groups that differ with respect to the variable of interest. (age, gender, ethnic group, income)
multistage cluster sampling
- Stage 1: Randomly select a sample of counties from all of the counties in the US
- Randomly select a sample of townships from each county in Stage 1
- Randomly select a sample of voting precincts from each township selected in Stage 2
- Randomly select a sample of registered voters from each voting precinct selected in Stage 3
take a sample of registered voters from all registered voters in the US
advantageous when selecting sample from a very large geographical region (a frame doesn’t exist)
systematic sampling
a sample taken by moving systematically through the population.
- Select a sample of n elements w/o replacement from a frame of N elements: divide N by n (round down to nearest whole number) = l
- Randomly select one element from the first l elements in the frame
- The remaining elements in the sample are obtained by selecting every l th element following the first element
types of survey questions
- dichotomous (yes/no)
- MCQ
- open-ended questions
Dichotomous Questions
- clearly stated
- can be answered quickly
- yield data that are easily analyzed
- cons: info many be limited by the two-option format
MCQ
- several different forms
- either categorical or numerical
open-ended questions
- most honest and complete information
- no suggested answers to divert or bias a person’s respone
phone survey
- inexpensive
- conducted by callers who have very little training
- impersonal nature -> respondent may misunderstood some of the questions
- some people cannot be reached and that others may refuse to some or all of the questions
=> low response rate
response rate
the proportion of all people whom we attempt to contact that actually respond to a survey.
mail surveys (self-administered surveys)
- inexpensive
- recipients often won’t reply unless they receive some kind of financial incentive or other reward
- the process can take significantly longer than a phone survey
web-based surveys
- same problems as mail surveys
- respondents may record their true reactions incorrectly because they have misunderstood some of the questions posed
personal interview
- more control
- more likely to respond (because of face-to-face)
- questions are less likely to be misunderstood because the people conducting the interviews are typically trained employees who can clear up any confusion
- cons: interviewers can potentially “lead” a respondent by body language + more costly
mall survey, 50% response rate
target population
the entire population of interest to us in a particular study
Sample frame
a** list of sampling elements** (people or things) from which the sample will be selected
(should closely agree with the target population)
Sampling error
The difference between a numerical descriptor of the population and the corresponding descriptor of the sample
Two types of sample errors
- errors of nonobservation: related to population elements that are not observed
- errors of observation: occurs when the data collected in a survey differs from the truth
Error of coverage
sample frame is different from the target population
- undercoverage: some pop. elements are excluded from the process of selecting the sample
Nonresponse
problem
occurs whenever some of the individuals who were supposed to be included in the sample are not
selection bias
- bias in the results
- related to how survey applicants are selected
response bias
- bias results
- related to how survey participants answer the survey questions