Final Exam Material Flashcards
what is data sourcing?
(also known as data collection) is the process of extracting data from external or internal sources
data sources include: enterprise databases (historical data, customer sign-up information), web data (web pages, social media), mobile data (apps, GPS), government data, and survey data
why use surveys as a data sourcing model?
efficient way to collect information about a large group of people, flexible medium that can measure attitudes/knowledge/preferences/etc., standardized–so less susceptible to error, easy to administer, can be tailored exactly by the topic you wish to study
keys to effective surveying
begin with clear purpose, know what you want to be able to do with the data ahead of time, identify the most logical group to survey
parts of a survey: title
should reflect the content of the survey, be easy to understand, and be concise
parts of a survey: introduction statement
provides brief summary of survey’s purpose, includes information about the respondent’s confidentiality, motivates the respondent to complete the survey, provides an estimate of the time required to complete, should be clear and concise
parts of a survey: questions
include directions for completing, each question should have a defined objective, notice question wording, lead with high-interest questions, close with demographic questions, and keep it brief by eliminating unnecessary questions
parts of a survey: survey logic
respondent should only be asked questions that apply to them, asking respondents to reply to questions that do not apply to them can lead to confusion and unreliable results (skip and display)
parts of a survey: closing statement
thank the respondent for participating, provide contact information for questions, explain how the survey results will be disseminated, if any incentive is offered–provide relevant information
double barreled questions
questions that attempt to get at multiple issues at once, and so tend to receive incomplete or confusing answers (ex. do you like pizza and ice cream?)
high-interest questions
should be at beginning of survey, most important
demographic (sensitive) questions
should be at end of survey, not as important but very helpful
question types: open-ended
provides respondents the opportunity to express themselves in their own words, no correct answers, often elicit unanticipated responses which provide new directions for research, can be difficult to interpret/analyze if clear themes do not emerge, short answer text or essay format
question types: closed-ended
more difficult to write than open-ended questions, have a finite set of answers, responses are easy to standardize and analyze statistically, may miss pertinent information if a key answer is not provided to respondents (can be corrected by using “other” response option)
advantages and disadvantages of open-ended questions
advantages:
respondents can define central issues, address the issue of “why”
disadvantages:
can be time consuming, results can be more challenging to analyze, leading questions can lead to less reliable results
advantages and disadvantages of closed-ended questions
advantages:
easy to answer, easier to analyze results
disadvantages:
cannot address the issue of “why,” limited options available to respondents, can be hard to gauge results (ex a 2 on a ranking can mean different things to different respondents)
types of survey logic (skip vs display)
skip logic: allows you to send respondents to a future point in the survey based on how they answer a question. (ex. if a respondent indicates that they don’t fit to your respondent criteria, they could immediately be skipped to the end of the survey.)
display logic: allows you to display questions conditionally based on the respondent’s answers to previous questions.
survey administration: population vs sample
population: the larger set of individuals you wish to understand
sample: a subset selected from a population to survey
sampling techniques: simple random sample
members of the subset are chosen completely at random so that every member of the population has an equal probability of being selected
sampling techniques: stratified sample
the population is divided up into relatively homogeneous groups; then, a proportionate probability sample is drawn from the groups
sampling techniques: convenient sample
members of the subset are selected according to their availability
survey analysis: reporting the results
a final report should include: purpose, design of survey, administration process, data analysis, and findings
primary data
data collected from the original source by the investigator himself/herself for a specific purpose
secondary data
data collected by someone else for some other purpose (but being utilized by the investigator for another purpose) or not from the original source
advantages and disadvantages of primary data
advantages:
data collected is specific to the problem, quality of data can be ensured, may be possible to obtain additional data
disadvantages:
expensive, time consuming, requires setup and manpower
advantages and disadvantages of secondary data
advantages:
cost-effective, quicker to gather
disadvantages:
you cannot decide what is collected (maybe out of date or inaccurate), no control over quality, hard to obtain additional data
robots.txt file
A text file that provides special instructions (e.g. privacy information) about a Web site to Web crawlers.
Web site owners use the robots.txt file to give instructions to web robots (e.g., scrapers) about their site
The file is structured to specify what parts of the site robots are DISALLOWED to examine
API
Application Programming Interface
intermediary software that allows two applications to talk to each other, through Web API, a sourcing application can talk to a website (i.e., extract information from the website), most websites require developer accounts to access their Web API
transactional information
encompasses all of the information contained within a single business process or unit of work, and its primary purpose is to support the performing of daily operational tasks
analytical information
encompasses all organizational information, and its primary purpose is to support performing of managerial analysis tasks
examples of transactional information
airline ticket, sales receipt, packing slip
examples of analytical information
product statistics, sales projections, future growth, trends
data quality
data that are fit for use by data consumers and satisfies the requirements of its intended use
(depends on what is needed to know)
high-quality data
data that are relevant and accurately represent their corresponding concepts
high-quality information
information that is relevant and a faithful representation of what is being reported
characteristics and examples of high-quality information
accurate: is there an incorrect value in the information? (name spelled correctly? is the dollar amount recorded properly?)
complete: is a value missing from the information? (is the address complete?)
consistent: is aggregate or summary information in agreement with detailed information? (do all columns equal the true total of the individual item?)
timely: is the information current with respect to business needs? (is information updated weekly, daily, or hourly?)
unique: is each transaction and event represented only once in the information? (are there any duplicate customers?)
benefits of high-quality information
Information is everywhere in an organization
Employees must be able to obtain and analyze the many different levels, formats, and granularities of organizational information to make decisions
Successfully collecting, compiling, sorting, and analyzing information can provide tremendous insight into how an organization is performing
examples of low information quality
missing information, incomplete information, probable duplicate information, potential wrong information, inaccurate information
sources of low-quality information
four primary sources:
- customers intentionally enter inaccurate information to protect their privacy
- different entry standards and formats
- operators enter abbreviated or erroneous information by accident or to save time
- third party and external information contains inconsistencies, inaccuracies, and errors
- parallel data entry (duplicates)
costs/consequences of low-quality information
potential business effects resulting from low quality information include:
- inability to accurately track customers
- difficulty identifying valuable customers
- inability to identify selling opportunities
- marketing to nonexistent customers
- difficulty tracking revenue
- inability to build strong customer relationships
why is data cleaning important?
improves your data quality and in doing so, increases overall productivity. When you clean your data, all outdated or incorrect information is gone - leaving you with the highest quality information