CHAPTER EIGHT THE ARCHITECTURE OF ANALYTICS AND BIG DATA ALIGNING A ROBUST TECHNICAL ENVIRONMENT WITH BUSINESS STRATEGIES Flashcards
What has become technically and economically feasible over the last decade?
Capturing and storing huge quantities of data
What are the data volume tiers mentioned?
- Megabytes
- Gigabytes
- Terabytes
- Petabytes
What percentage of all data is estimated to be analyzed?
0.5 percent
What challenges do most IT departments face regarding data?
Strain to meet minimal service demands and invest resources in support and maintenance
What is a common issue organizations face when integrating data into analytical applications?
Data cleansing
What is the role of IT departments in analytics?
Manage information technology for analytics and other applications
What critical task must organizations determine for analytical architecture?
How to encourage insightful answers and prevent uncontrolled proliferation of ‘versions of the truth’
What is necessary for determining technical capabilities for analytical competition?
Close collaboration between IT organizations and business managers
What should guiding principles for technology investments reflect?
Corporate priorities
What is the job of the IT architect or chief data officer?
To ensure the right data, technology, and processes for analytics across the enterprise
What are the stages of analytical competition?
- Stage 1: Poor-quality data and poorly integrated systems
- Stage 2: Efficient transaction data collection but lacking the right data
- Stage 3: Proliferation of BI tools but non-standard data
- Stage 4: High-quality data with an enterprise-wide analytical plan
- Stage 5: Full-fledged analytics architecture with integrated big and small data
What does the analytics and big data architecture encompass?
Processes and technologies for collecting, structuring, managing, and reporting decision-oriented data
What are the six elements of the analytics and big data architecture?
- Data management
- Transformation tools and processes
- Repositories
- Analytical tools and applications
- Data visualization tools and applications
- Deployment processes
What is the goal of a well-designed data management strategy?
To ensure the organization has the right information and uses it appropriately
What is a major challenge companies face regarding data?
Dirty data: inconsistent, fragmented, and out of context information
What questions must IT and business experts tackle to achieve analytical competition?
- Data relevance
- Data sourcing
- Data quantity
- Data quality
- Data governance
What does data relevance pertain to?
What data is needed to compete on analytics
What is the significance of having access to the right data?
It is crucial for competitive differentiation and business performance
What problem arises from the collaboration between IT and business managers?
Blame for wrong data collection or unavailability of right data
What companies have improved cooperation between quantitative analysts and business leaders?
- Intel
- Procter & Gamble
What do IT executives believe about business managers regarding data needs?
They believe business managers do not understand what data they need
This reflects a gap in communication and understanding between IT and business sides.
What do surveys of business managers reveal about IT executives?
Business managers believe IT executives lack the business acumen to make meaningful data available
This indicates a need for better collaboration between IT and business leaders.
What is essential for organizations to compete analytically?
Cooperation between business leaders and IT managers
Without this cooperation, data gathering for competitive analysis is severely limited.
What role do quantitative analysts play in companies like Intel and Procter & Gamble?
They work closely alongside business leaders
This collaboration helps bridge the gap between data analysis and business needs.
What is crucial for defining relationships among data used in analysis?
Considerable business expertise is required
This expertise helps IT understand potential relationships in the data.
Identify the types of customers an insurance company may have.
- Corporate customers
- Individual subscribers
- Members of subscribers’ families
Each type of customer has unique medical histories and relationships with service providers.
What is necessary for data to be useful for analytics?
Insight into the nature of relationships among the data
Without this insight, data usefulness is extremely limited.
Where does data for analytics originate?
It originates from many places and needs to be managed through an enterprise-wide infrastructure
This ensures that data is streamlined, consistent, and scalable.
What is the importance of having common applications and data across the enterprise?
It helps yield a ‘consistent version of the truth’
This is essential for everyone involved in analytics.
What are enterprise systems?
Integrated software applications that automate, connect, and manage information flows for business processes
They are critical for providing consistent data for tasks like financial reporting.
What is edge analytics?
A paradigm where data is analyzed at the source rather than being sent to a centralized repository
This approach is becoming more common due to the growth of IoT devices.
What is a challenge with collecting data from IoT devices?
It is often unfeasible to send all data to a central repository for analysis
Real-time analytics at the edge can optimize operations.
What are some sources of external data?
- Internet
- Social media
- External data providers
- Government information
- Company websites
These sources provide valuable data for analytics.
What is a potential legal issue with data collection?
Sensitive customer information may be illegal to capture
Organizations must navigate legal constraints when collecting data.
What is the significance of Progressive’s Snapshot program?
It offers discounts for customers who allow data collection about their driving behavior
This helps in accurate pricing and understanding risk.
How much data was Walmart’s data warehouse in 2007?
About 600 terabytes
This was the largest data warehouse at that time.
What is the current trend in data volume management?
Hadoop clusters storing data across multiple commodity servers
This technology allows for large-scale data management.
What two pitfalls should companies avoid in data collection?
- Collecting all possible data ‘just in case’
- Collecting data that is easy to capture but not important
Both can lead to data overload and inefficiency.
What are some characteristics that increase the value of data?
- Correctness
- Completeness
- Currency
- Consistency
- Context
- Control
- Analysis
These attributes ensure data is actionable and valuable.
What is the first step in the data management life cycle?
Data acquisition
This involves determining what data is needed and how to integrate IT systems.
What does data cleansing involve?
Detecting and removing out-of-date, incorrect, incomplete, or redundant data
This is critical for ensuring data quality.
What is the purpose of data organization and storage?
To systematically extract, integrate, and synthesize data for use
This ensures data is ready for analysis.
What is ETL in data management?
Extract, Transform, Load
A traditional process for making data usable in a data warehouse.
What is the role of transformation tools in data management?
They clean and validate data to make it decision-ready
This is necessary for accurate analytics.
What is the role of transformation procedures in data management?
Transformation procedures define the business logic that maps data from its source to its destination.
Why is significant manual effort required in data transformation?
Both business and IT managers must expend significant effort to transform data into usable information.
What is the estimated labor cost for data integration according to Sohaib Abbasi?
For every dollar spent on integration technology, around seven to eight dollars is spent on labor for manual data coding.
What is the challenge of defining business concepts like ‘customer’ in data transformation?
A ‘customer’ may be defined as a company in one system but as an individual in another, leading to inconsistent definitions.
What can be done with missing data during transformation?
Missing data can sometimes be filled using inferred data or projections, or it may simply remain missing.
What role do automated machine learning systems play in data standardization?
They help identify likely overlaps and redundancies in data.
What is a data warehouse?
A data warehouse is a database that contains integrated data from different sources and is regularly updated.
What is the purpose of a data mart?
Data marts support a single business function or process and usually contain predetermined analyses.
What is contained in a metadata repository?
It contains technical information, data definitions, source information, and instructions on how the data should be applied.
What are the advantages of open-source distributed data frameworks like Hadoop?
They allow storage of data in any format at lower costs than traditional warehouses but may require higher technical expertise.
What is a data lake?
A data lake stores data in its original format and structures it as it is accessed for analysis.
What factors influence the choice of analytical tools?
Factors include how thoroughly decision making should be embedded in business processes and whether to use third-party applications or custom solutions.
What is the ROI of implementing packaged analytical applications according to IDC?
The median ROI is 140 percent.
What are some major players in the analytical software market?
Major players include SAS, IBM, SAP, R, and RapidMiner.
What is the primary use of spreadsheets in analytics?
Spreadsheets are used for the ‘last mile’ of analytics before data presentation.
What are OLAP tools used for?
OLAP tools are used for semistructured decisions and analyses on relational data.
What is a key characteristic of data visualization tools like Tableau?
They operate on entire datasets rather than just data cubes.
What do statistical algorithms enable managers to do?
They enable managers to analyze data to arrive at optimal targets such as prices or loan amounts.
How do rule engines function?
Rule engines process conditional statements to address logical questions.
What technologies have superseded rule engines in popularity?
Machine learning and cognitive technologies.
What is the objective of data mining tools?
To identify patterns in complex and ill-defined data sets.
What can text mining tools help managers identify?
Emerging trends in near-real time.
What is text categorization?
Using statistical models or rules to rate a document’s relevance to a certain topic.
What is the purpose of natural language processing tools?
To make sense of language and answer human questions.
What is event streaming used for?
To analyze data as it comes in from applications like the Internet of Things.
What do simulation tools model?
Business processes using mathematical and scientific functions.
What is the main focus of web or digital analytics?
Managing and analyzing online and e-commerce data.
What is A/B testing in web analytics?
Statistical comparisons of which version of a website gets more clicks.
What is web analytics?
A category of analytical tools for managing and analyzing online and e-commerce data.
What type of information does web analytics typically provide?
Descriptive information such as unique visitors, time spent on a site, and conversion rates.
What does A/B testing in web analytics entail?
Statistical comparisons of which version of a website gets more clicks or conversions.
What is social media analytics focused on?
Counting social activities and assessing the sentiment associated with them.
What problem can arise from excessive technological proliferation in analytics?
It can lead to the use of too many analytics and data management tools without a coherent architecture.
According to a 2015 survey, how many analytics and data management tools did marketing organizations average?
More than twelve tools.
What was a key barrier to success for Equifax identified in a 2010 assessment?
Analytics activities took too long to complete due to organizational and data-related issues.
What significant change has Equifax made to its analytics infrastructure?
Shifted to a Hadoop-based data lake for easier and cost-effective data assembly.
How has the speed of analytics changed at Equifax under new leadership?
Analytics evaluation time reduced from about a month to just a few days.
What analytical model does Equifax use to identify trends in consumer credit history?
A neural network model.
Which tools does Equifax incorporate into its analytics technology architecture?
Open-source tools like R and Python.
What is the role of business intelligence software?
Allows users to create reports, visualize data, and share insights.
What do visual analytical tools enable users to do without statistical skills?
Manipulate data and analyses through an intuitive visual interface.
What is critical for effective deployment of analytics?
Creating, managing, implementing, and maintaining data and applications.
What is a deployment platform in analytics?
A structured approach to managing the deployment process of analytics.
What are major concerns related to deployment processes in analytics?
Privacy, security, and the ability to archive and audit data.
What indicates a company has its analytics act together?
Centralized analytical roles and some degree of central coordination.
What must senior management establish for a robust analytical architecture?
Guiding principles that align architectural decisions with business strategy.
Fill in the blank: An enterprise-wide approach to managing data and analytics is often viewed as a _______.
renegade activity.
What should the analytics architecture be able to do in a fast-changing environment?
Be flexible and adapt to changing business needs and objectives.