Lecture 1 Flashcards
5V’s of Big Data
5v’s: Volume, Variety, Velocity, Value and Veracity
Caveat 1: Big Data and Thick data
There are many reasons for Nokia’s downfall, but one of the
biggest reasons that I witnessed in person was that the company
over-relied on numbers. They put a higher value on quantitative
data, they didn’t know how to handle data that wasn’t easily
measurable, and that didn’t show up in existing reports.”
Tricia Wang 2016
• Message
• Beware of Quant Bias or Quant Addiction
• Big Data in many cases needs to be supported with Thick Data
• Thick Data (Emotion, Context, Meaning,..)
•Ethnography – people’s way of living, culture
• Improve the use of big data or analytics by seeing the whole
picture of decision making.
Why “big data” NOW?
• Availability of massive amounts of digital data
• Combination of technical developments and societal
needs
• A philosophical view
• Rationalism vs empiricism
• The discovery of the power of data
Why Big Data now: Technical developments
Radical changes in:
• The way elementary data are captured
• Sensors (automated) vs keyboard (human)
• The way data is stored
• Main memory and cloud vs hard disk
• The way data is analyzed
• Data-driven methods vs sampling
• The way data is provided to users
• Data logistics vs data integration
• The way data is presented
• Graphical interactive visualizations vs management reports
• The way knowledge (business rules, models) is created
• Learning/mining vs (labor-intensive) knowledge acquisition
Three types of learning
Supervised:
- Labeled data
- Focused outcomes
- Assess/Predict
Unsupervised:
- No initial focus
- No feedback
- Clustering
Reinforcement:
- Action/Results
- Reward function
- Learning for planning or action
Machine Learning learning
We don’t solve problems with Machine Learning, we solve problems with the rules and knowledge that ML builds
Steps in ML
- Data and Analytics
- Machine Learning
- Reasoning
- Partnerships
Each step supports the next
Data science definition (Provost)
is a set of fundamental principles that support and guide the principled extraction of information and knowledge from data.
o Sometimes referred to as “Applied AI”
Data mining definition (Provost)
is the actual extraction of knowledge from data, via technologies that incorporate these principles
Data-driven Decision Making (DDD) definition Provost
is the practice of basing decisions on the analysis of data, rather than purely on intuition
Data Science principles (By Provost)
• Entities that are similar with respect to known features or attributes often are similar with respect to unknown features or attributes
• Deal with missing information as far as it goes
o Cf. old “closed-world” view in traditional database: ‘not in DB, then false’
Data Science Principles (1)
Extracting useful knowledge from data to solve
business problems can be treated systematically by
following a process with reasonably well-defined
stages. The Cross-Industry Standard Process for Data
Mining (CRISP-DM7) is one codification of this
process.
Data Science Principles (2)
• If you look too hard at a set of data, you will find
something—but it might not generalize beyond the
data you’re looking at (problem of overfitting)
• To draw causal conclusions, one must pay very close
attention to the presence of confounding factors,
possibly unseen ones (observation vs intervention)
• When using AI heuristics to find some optimum you
may end up in a local maximum.
• The relationship between the business problem and
the analytics solution often can be decomposed into
tractable subproblems via the framework of analyzing
expected value.
Types of Business Analytics research
- Applying analytics to answer a business question
• Problem-oriented. Management Science. Result is an insight with practical
value
• Scientific value depends on genericity
• Example: “analyze effects of social media intensity on program item
redemption” - Developing an analytics-based tool/ process/ method
• Design Science research (DSR): result is an IS application
• Example: “association rule-based anomaly detection for event logs” - Improving upon analytical techniques/tools
• Technical Design research (CS): result is new or improved algorithm - Other types of research: identify and address legal challenges,
economic consequences, …
Final remarks lecture 1
• Data Science is a response to Big Data by adopting
AI.
• AI is making big steps because of Big Data
• A data science solution must be embedded in the
business. This is not a simple step.
• The increasing use of AI in business imposes new
challenges:
• Human-centered AI: how to make best use of humans and
machines
• Responsible AI