Final Flashcards
There are basic chart types and specialized chart types. A Gantt chart is a specialized chart type.
True
This measure of central tendency is the sum of all the values/observations divided by the number of observations in the data set.
arithmetic mean
Subject oriented databases for data warehousing are organized by detailed subjects such as disk drives, computers, and networks.
False
Two-tier data warehouse/BI infrastructures offer organizations more flexibility but cost more than three-tier ones.
False
A(n) ________ architecture is used to build a scalable and maintainable infrastructure that includes a centralized data warehouse and several dependent data marts.
Hub-and-spoke
Which broad area of data mining applications analyzes data, forming rules to distinguish between defined classes?
Classification
Converting continuous valued numerical variables to rangers and categories is referred to as discretization
TRUE
Data source reliability means that data are correct and are a good match the analytics problem
False
The competitive imperatives for BI include all of the following except
Right user
In the 2000s, the DW-driven DSSs began to be called BI systems.
TRUE
How are enterprise resources planning (ERP) systems related to supply chain management (SCM) systems?
Complementary systems
OLTP systems are designed to handle ad hoc analysis and complex queries that deal with many data items.
FALSE
Information dashboards enable ________ operations that allow the users to view underlying data sources and obtain more detail.
drill-down/drill-through
In the opening case, police detectives used data mining to identify possible new areas of inquiry.
False
Clustering partitions a collection of things into segments whose members share
Similar Characteristics
Ratio data is a type of categorical data.
False
Which of the following is a data mining myth?
Data mining requires a separate, dedicated database.
If using a mining analogy, “knowledge mining” would be a more appropriate term than “data mining.”
True
The cost of data storage has plummeted recently, making data mining feasible for more firms.
True
In the Miami-Dade Police Department case study, predictive analytics helped to identify the best schedule for officers in order to pay the least overtime.
False
All of the following statements about data mining are true EXCEPT
the process aspect means that data mining should be a one-step process to results.
Using data mining on data about imports and exports can help to detect tax avoidance and money laundering.
true
K-fold cross-validation is also called sliding estimation.
False
Big Data often involves a form of distributed storage and processing using Hadoop and MapReduce. One reason for this is
the processing power needed for the centralized model would overload a single computer.
In the Opening Vignette on Sports Analytics, what type of modeling was used to predict offensive tactics?
Heat Maps
What type of analytics seeks to recognize what is going on as well as the likely forecast and make decisions to achieve the best performance possible?
Prescriptive
Demands for instant, on-demand access to dispersed information decrease as firms successfully integrate BI into their operations.
False
The use of dashboards and data visualizations is seldom effective in identifying issues in organizations, as demonstrated by the Silvaris Corporation Case Study.
false
Today, many vendors offer diversified tools, some of which are completely preprogrammed (called shells). How are these shells utilized?
All a user needs to do is insert the numbers.
The growth in hardware, software, and network capacities has had little impact on modern BI innovations.
False
Information systems that support such transactions as ATM withdrawals, bank deposits, and cash register scans at the grocery store represent transaction processing, a critical branch of BI.
False
If using a mining analogy, “knowledge mining” would be a more appropriate term than “data mining.”
True
Because of performance and data quality issues, most experts agree that the federated architecture should supplement data warehouses, not replace them.
True
Ratio data is a type of categorical data
FALSE
The use of dashboards and data visualizations is seldom effective in identifying issues in organizations as demonstrated by the Silvarts corporation Case study
False
Markey basket
False
In text mining, if an association between two concepts has 7% support, it means that 7% of the documents had both concepts represented in the same document.
True
________ is a segmentation metric for social networks that measures the strength of the bonds between actors in a social network
Cohesion
What has caused the growth of the demand for instant, on-demand access to dispersed information?
the more pressing need to close the gap between the operational data and strategic objectives
The need for more versatile reporting than what was available in 1980s era ERP systems led to the development of what type of system?
executive information systems
What storage system and processing algorithm were developed by Google for Big Data?
*
Google developed and released as an Apache project the Hadoop Distributed File System
(HDFS) for storing large amounts of data in a distributed way.
*
Google developed and released as an Apache project the MapReduce algorithm for pushing
computation to the data, instead of pushing data to a computing node.
Describe the role of the simple split in estimating the accuracy of classification models.
The simple split (or holdout or test sample estimation) partitions the data into two mutually exclusive subsets called a training set and a test set (or holdout set). It is common to designate two-thirds of the data as the training set and the remaining one-third as the test set. The training set is used by the inducer (model builder), and the built classifier is then tested on the test set. An exception to this rule occurs when the classifier is an artificial neural network. In this case, the data is partitioned into three mutually exclusive subsets: training, validation, and testing.
Data is the contextualization of information, that is, information set in context
False
This measure of dispersion is calculated by simply taking the square root of the variations.
standard deviation
Nominal data represent the labels of multiple classes used to divide a variable into specific groups.
False
In the Dallas Cowboys case study, the focus was on using data analytics to decide which players would play every week.
False
This plot is a graphical illustration of several descriptive statistics about a given data set
Box and whisker plot
Which type of visualization tool can be very helpful when the intention is to show relative proportions of dollars per department allocated by a university administration?
Pie chart
Which type of visualization tool can be very helpful when a data set contains location data?
Geographic map
The data storage component of a business reporting system builds the various reports and hosts them for, or disseminates them to users. It also provides notification, annotation, collaboration, and other services.
False
One way an operational data store differs from a data warehouse is the recency of their data.
True
Properly integrating data from various databases and other disparate sources is a trivial process.
False
What is Six Sigma?
a methodology aimed at reducing the number of defects in a business process
A Web client that connects to a Web server, which is in turn connected to a BI application server, is reflective of a
three-tier architecture
The data warehousing maturity model consists of six stages: prenatal, infant, child, teenager, adult, and sage.
True
When representing data in a data warehouse, using several dimension tables that are each connected only to a fact table means you are using which warehouse structure?
Star schema
User-initiated navigation of data through disaggregation is referred to as “drill up.”
False
Data warehouses are subsets of data marts.
False
The BPM development cycle is essentially a one-shot process where the requirement is to get it right the first time.
False
Operational or transaction databases are product oriented, handling transactions that update the database. In contrast, data warehouses are
Subject-oriented and nonvolatile
What type of analytics seeks to determine what is likely to happen in the future?
Predictive
Online transaction processing (OLTP) systems handle a company’s routine ongoing business. In contrast, a data warehouse is typically
a distinct system that provides storage for data that will be made use of in analysis.
In the Opening Vignette on Sports Analytics, what was adjusted to drive one-time ticket sales?
Ticket prices
Successful BI is a tool for the information systems department, but is not exposed to the larger organization.
False
Business intelligence (BI) is a specific term that describes architectures and tools only.
False
Managing information on operations, customers, internal procedures and employee interactions is the domain of cognitive science.
False
The user interface of a BI system is often referred to as a(n) ________.
Dashboard
As the number of potential BI applications increases, the need to justify and prioritize them arises. This is not an easy task due to the large number of ________ benefits.
Intangible
________ series forecasting is the use of mathematical modeling to predict future values of the variable of interest based on previously observed values.
Time
Dashboards present visual displays of important information that are consolidated and arranged on a single ________.
Screen
Descriptive statistics is all about describing the sample data on hand.
True
Which characteristic of data requires that the variables and data values be defined at the lowest (or as low as required) level of detail for the intended use of the data?
data granularity
In the FEMA case study, the BureauNet software was the primary reason behind the increased speed and relevance of the reports FEMA employees received.
True
Dashboards provide visual displays of important information that is consolidated and arranged across several screens to maintain data order.
False
Which characteristic of data means that all the required data elements are included in the data set?
Data richness
Data source reliability means that data are correct and are a good match for the analytics problem.
False
With the balanced scorecard approach, the entire focus is on measuring and managing specific financial goals based on the organization’s strategy.
False
Moving the data into a data warehouse is usually the easiest part of its creation.
False
Data warehouse administrators (DWAs) do not need strong business insight since they only handle the technical aspect of the infrastructure.
False
Because the recession has raised interest in low-cost open source software, it is now set to replace traditional enterprise software.
False
The three main types of data warehouses are data marts, operational ________, and enterprise data warehouses.
Data stores
Data mining can be very useful in detecting patterns such as credit card fraud, but is of little help in improving sales.
False
In the Influence Health case study, what was the goal of the system?
increasing service use
Understanding customers better has helped Amazon and others become more successful. The understanding comes primarily from
analyzing the vast data amounts routinely collected.
Statistics and data mining both look for data sets that are as large as possible.
False
The data field “ethnic group” can be best described as
nominal data.
In the Target case study, why did Target send a teen maternity ads?
Target’s analytic model suggested she was pregnant based on her buying habits.
One way to accomplish privacy and protection of individuals’ rights when data mining is by ________ of the customer records prior to applying data mining applications, so that the records cannot be traced to an individual.
de-identification
Patterns have been manually ________ from data by humans for centuries, but the increasing volume of data in modern times has created a need for more automatic approaches.
Extracted
In the Influence Health case, the company was able to evaluate over ________ million records in only two days.
195
What are the most important assumptions in linear regression?
- Linearity. This assumption states that the relationship between the response variable and the explanatory variables is linear. That is, the expected value of the response variable is a straight-line function of each explanatory variable, while holding all other explanatory variables fixed. Also, the slope of the line does not depend on the values of the other variables. It also implies that the effects of different explanatory variables on the expected value of the response variable are additive in nature. 2. Independence (of errors). This assumption states that the errors of the response variable are uncorrelated with each other. This independence of the errors is weaker than actual statistical independence, which is a stronger condition and is often not needed for linear regression analysis. 3. Normality (of errors). This assumption states that the errors of the response variable are normally distributed. That is, they are supposed to be totally random and should not represent any nonrandom patterns. 4. Constant variance (of errors). This assumption, also called homoscedasticity, states that the response variables have the same variance in their error, regardless of the values of the explanatory variables. In practice this assumption is invalid if the response variable varies over a wide enough range/scale. 5. Multicollinearity. This assumption states that the explanatory variables are not correlated (i.e., do not replicate the same but provide a different perspective of the information needed for the model). Multicollinearity can be triggered by having two or more perfectly correlated explanatory variables presented to the model (e.g., if the same explanatory variable is mistakenly included in the model twice, one with a slight transformation of the same variable). A correlation-based data assessment usually catches this error.
With ________, all the data from every corner of the enterprise is collected and integrated into a consistent schema so that every part of the organization has access to the single version of the truth when and where needed.
Enterprise Resource Planning (ERP)
Briefly describe five techniques (or algorithms) that are used for classification modeling.
*
Decision tree analysis. Decision tree analysis (a machine-learning technique) is arguably the
most popular classification technique in the data mining arena.
*
Statistical analysis. Statistical techniques were the primary classification algorithm for many
years until the emergence of machine-learning techniques. Statistical classification techniques include logistic regression and discriminant analysis.
*
Neural networks. These are among the most popular machine-learning techniques that can be
used for classification-type problems.
*
Case-based reasoning. This approach uses historical cases to recognize commonalities in order
to assign a new case into the most probable category.
*
Bayesian classifiers. This approach uses probability theory to build classification models based
on the past occurrences that are capable of placing a new instance into a most probable class (or category).
*
Genetic algorithms. This approach uses the analogy of natural evolution to build
directed-search-based mechanisms to classify data samples.
*
Rough sets. This method takes into account the partial membership of class labels to predefined
categories in building models (collection of rules) for classification problems.
Six Sigma rests on a simple performance improvement model known as DMAIC. What are the steps involved?
Define. Define the goals, objectives, and boundaries of the improvement activity. At the top level, the goals are the strategic objectives of the company. At lower levels—department or project levels—the goals are focused on specific operational processes. 2. Measure. Measure the existing system. Establish quantitative measures that will yield statistically valid data. The data can be used to monitor progress toward the goals defined in the previous step. 3. Analyze. Analyze the system to identify ways to eliminate the gap between the current performance of the system or process and the desired goal. 4. Improve. Initiate actions to eliminate the gap by finding ways to do things better, cheaper, or faster. Use project management and other planning tools to implement the new approach. 5. Control. Institutionalize the improved system by modifying compensation and incentive systems, policies, procedures, manufacturing resource planning, budgets, operation instructions, or other management systems.
Many business users in the 1980s referred to their mainframes as “the black hole,” because all the information went into it, but little ever came back and ad hoc real-time querying was virtually impossible.
True
Computerized support is only used for organizational decisions that are responses to external pressures, not for taking advantage of opportunities.
False
Data generation is a precursor, and is not included in the analytics ecosystem.
False
In what decade did disjointed information systems begin to be integrated?
1980s
Major commercial business intelligence (BI) products and services were well established in the early 1970s.
False
BI represents a bold new paradigm in which the company’s business strategy must be aligned to its business intelligence analysis initiatives.
False
Kaplan and Norton developed a report that presents an integrated view of success in the organization called
balanced scorecard-type reports.
Interval data are variables that can be measured on interval scales.
True
Predictive algorithms generally require a flat file with a target variable, so making data analytics ready for prediction means that data sets must be transformed into a flat-file format and made ready for ingestion into those predictive algorithms.
True
Data accessibility means that the data are easily and readily obtainable
True
This measure of central tendency is the sum of all the values/observations divided by the number of observations in the data set
arithmetic mean
Structured data is what data mining algorithms use and can be classified as categorical or numeric.
True
Key performance indicators (KPIs) are metrics typically used to measure
Internal results
Visual analytics is aimed at answering, “What is it happening?” and is usually associated with business analytics.
False
Oper marts are created when operational data needs to be analyzed
multidimensionally.
With the balanced scorecard approach, the entire focus is on measuring and managing specific financial goals based on the organization’s strategy.
False
Which of the following BEST enables a data warehouse to handle complex queries and scale up to handle many more requests?
parallel processing
When querying a dimensional database, a user went from summarized data to its underlying details. The function that served this purpose is
Drill down
_______ is an evolving tool space that promises real-time data integration from a variety of sources, such as relational databases, Web services, and multidimensional databases.
Enterprise information integration (EII)
Which data warehouse architecture uses a normalized relational warehouse that feeds multiple data marts?
hub-and-spoke data warehouse architecture
All of the following are benefits of hosted data warehouses EXCEPT
greater control of data.
Why is a performance management system superior to a performance measurement system?
because measurement alone has little use without action
In the Influence Health case study, what was the goal of the system?
increasing service use
Which data mining process/methodology is thought to be the most comprehensive, according to kdnuggets.com rankings?
CRISP-DM
In estimating the accuracy of data mining (or other) classification models, the true positive rate is
the ratio of correctly classified positives divided by the total positive count.
What is the main reason parallel processing is sometimes used for data mining?
because of the massive data amounts and search efforts involved
Identifying and preventing incorrect claim payments and fraudulent activities falls under which type of data mining applications?
Insurance
is an evolving tool space that promises real-time integration from a variety of sources, such as relational databases. Web services, and multidimensional databases.
Enterprise information integration (EII)
Which Datawarehouse architecture uses a normalized relational warehouse that feeds multiple data marts
hub-and-spoke data warehouse architecture
Data warehouse s provide an indirect benefits to organizations. Which of the following is an indirect benefit of data warehouses?
improved customer service
All of the following are true about in-database processing technology except
The potentially useful aspect means that the results should lead to some business benefit
The Data warehousing maturity model consists of six stages: prenatal, infant, child, teenager, adult, and sage
TRUE
List 4 possible analytics applications in the retail value chain
Inventory, Price Elasticity, Shopper Insight, Store Layout
In the dell case study, the largest issue was how to properly spend the online marketing budget
FALSE
The entire focus of the predictive analytics system in the infinity P &C case was on detecting and handing fraudulent claims for the company’s benefit
FALSE
Using data mining on data about imports and exports can help to detect tax avoidance and money laundering
TRUE
Understanding customers better has helped amazon and other become more successful. The understading comes primarily from
analyzing the vast data amounts routinely collected.
Which of the following is a data mining myth
Data mining requires a separate, dedicated database.
Nominal data represent the labels of multiple classes used to divide a variable into specific groups
False
Which type of question does visual analytics seek to answer
Why did it happen?
To respond to its market challenges, Serius XM decidsed to docus on manufacturing efficiency
False
Data is the main ingredient for any BI data science, and business analytics initiative
False
Google maps has set new standards for data visualization with its intuituve web mapping software
TRUE
Dashboards provide visual displays of important information that is consolidated and arranged across several screens to maintain data order
False
Traditional BI systems use a large volume of statistic data that has been extracted cleaned and loaded into a data warehouse to produce reports and analyze.
TRUE
Big data often involves a form of distribution storage and processing using Handoop and MapReduce. One reason for this is
the processing power needed for the centralized model would overload a single computer.
Which is of the following is NOT an example of transaction processing
Sales report
Data generation is a precursor, and is not included in the analytics ecosystem
FALSE
What type of analytics seeks to determine what is likely to happen in the future.
Predictive
if using a mining analogy, “knowledge mining” would be a more appropriate term than “data mining.”
TRUE
Data mining can be very useful in detecting patterns such as credit card fraud, but is of little help in improving sales
FALSE
In data mining, classification models help in prediction.
TRUE
Structured data is what data mining algorithms use and can be classified as categorical or numeric
TRUE
Which of the following is LEAST related to data/information visualization?
Statistical graphics
Visualization differs from traditional charts and graphs in complexity of data sets and use of multiple dimensions and measures.
TRUE
Dashboards can be presented at all the following levels EXCEPT
The visual cube level
Descriptive statistics is about describing the sample data on hand
TRUE
Business applications have moved from transaction processing and monitoring to other activities. Which of the following is NOT one of those activities?
Data monitoring
Managing data warehouses requires special methods, including parallel computing and/or Hadoop/Spark
TRUE
The very design that makes an OLTP system efficient for transaction processing makes it inefficient for
end-user ad hoc reports, queries, and analysis.
Real-time data warehousing can be used to support the highest level of decision making sophistication and power. The major feature that enables this in relation to handling the data is
speed of data transfer.
Data warehousing administrators(DWA) do not need strong business insight since they only handle the technical aspect of the infrastructure
FALSE
Data warehouses are subsets of data marts
FALSE
Subject oriented databases for data warehousing are organized by detailed subjects such as disk drives, computers, and networks
FALSE
Organizations seldom devote a lot of effort to creating metadata because it is not important for the effective use of data warehouses.
FALSE
Which approach to data warehouse integration focuses more on sharing process functionality than data across systems?
Enterprise application integration
Which kind of data warehouse is created separately from the enterprise data warehouse by a department and not reliant on it for updates?
Independent data mart
A large storage location that can hold vast quantities of data (mostly unstructured) in its native/raw format for future/potential analytics consumption is referred to as a(n)
Data Lake
The “islands of data” problem in the 1980s describes the phenomenon of unconnected data being stored in numerous locations within an organization.
True
Which of the following developments is NOT contributing to facilitating growth of decision support and analytics?
Locally concentrated workforces
During classification in data mining, a false positive is an instance classified as true by the model while being false in reality.
TRUE
In data mining, finding an affinity of two products to be commonly together in a shopping cart is known as
association rule mining
All of the following statements about data mining are true EXCEPT:
The ideas behind it are relatively new
Third party providers of publicly available data sets protect the anonymity of the individuals in the data set primarily by
removing identifiers such as names and social security numbers.
Which data mining process
CRISP
Contextual metadata for a dashboard includes all the following EXCEPT
which operating system is running the dashboard server software.
What is the management feature of a dashboard?
Operational data that is identify what actions to take to resolve a problem
Benefits of the latest visual analytics tools, such as SAS Visual Analytics, include all of the following EXCEPT
they explore massive amounts of data in hours, not days.
When you tell a story in a presentation, all of the following are true EXCEPT
a well-told story should have no need for subsequent discussed
Relational databases began to be used in the:
1980s
Decision support system (DSS) and management information system (MIS) have precise definitions agreed to by practitioners.
FALSE
Computer applications have moved from transaction processing
FALSE
Describe and define Big Data. Why is a search engine a Big Data application?
Data that cannot be stored in a single storage unit. It refers to data that arrives in multiple forms (structured or unstructured, or in a stream) A search engine is a big data application because it requires the user to search up a certain topic / question and in return the web searches and delivers billions of web pages relevant to the users search in a fraction of a second
There are several basic information system architectures that can be used for data warehousing. What are they?
Some IS architectures that can be used for data warehousing are one, two, and three-tier architectures
List 5 reasons for the growing popularity of data mining in the business world
Recognize fraud
Identifies rick factors
Can improve customer relationships
Advances in both computer hardware and software
More accessible and affordable
List the five most common functions of a business report
To ensure that all departments are functioning properly
To provide information
To provide the results of an analysis
To persuade others to act
To create an organizational memory (as part of a knowledge management system)
More data, coming in faster and requiring immediate conversion into decisions, means that organizations are confronting the need for RDW. What is RDW?
also known as active data warehousing (ADW), is the process of loading and providing data via the data warehouse as they become available.
Which of the following is an umbrella term that combines architectures, tools, databases, analytical tools, applications, and methodologies?
BI
Describe the difference between descriptive and inferential statistics
Descriptive statistics describe sets of data. Inferential statistics draws conclusions about the sets of data based on sampling
A common way of introducing data wharehousing is to refer to its fundamental characteristics. Describe three characteristics of data wharehousing
Subject oriented: Data is organized by detailed subject, such as the sales, products , or customers, containing data relevant for decision support.
Integrated: Must place data from different sources into consistent format . To do so they have to deal with various conflicts.
Nonvolatile: After the data is entered into the data warehouse, users cannot change the data or update it. Changes are recorded as new data.
In lessons learned from the Target case. What leagal warning would you give another reseller using data mining for marketing
If you look at the case you can see that Target didn’t violate any law. Target didn’t use any information that violates customer privacy. They only used transactional data that every other retail store obtains and stores. In terms of legal matters they didn’t do anything wrong.
In the Tito’s Vodka case study, trends in cocktails were studied to create a quarterly recipe for customers.
True
Search engine optimization (SEO) is a means by which
Web site developers can increase Web site search rankings
In the Wimbledon case study, designers balanced the needs of mobile and desktop computer users.
True
What types of documents are BEST suited to semantic labeling and aggregation to determine sentiment orientation?
small- to medium-sized documents
Search engine optimization (SEO) techniques play a minor role in a Web site’s search ranking because only well-written content matters.
False
Text analytics is the subset of text mining that handles information retrieval and extraction, plus data mining.
False
In the car insurance case study, text mining was used to identify auto features that caused injuries
False
________ is a connections metric for social networks that measures the ties that actors in a network have with others that are geographically close.
Propinquity
________ Web analytics refers to measurement and analysis of data relating to your company that takes place outside your Web site.
Off-site
Categorization and clustering of documents during text mining differ only in the preselection of categories.
True
Web-based media has nearly identical cost and scale structures as traditional media.
false
Web site usability may be rated poor if
Web site visitors download few of your offered PDFs and videos.
Companies understand that when their product goes “viral,” the content of the online conversations about their product does not matter, only the volume of conversations.
False
In text mining, tokenizing is the process of
categorizing a block of text in a sentence
Clickstream analysis does not need users to enter their perceptions of the Web site or other feedback directly to be useful in determining their preferences.
True
IBM’s Watson utilizes a massively parallel, text mining-focused, probabilistic evidence-based computational architecture called
DeepQA