TOPIC 3 - BIG DATA, DATA ANALYTICS, AND MACHINE LEARNING Flashcards
How can data help businesses
Smarter and faster decisions
Accurate predictions
Sorting the signal from the noise
Efficient operations including real-time changes
Data is often viewed as what by companies
An asset
What is the cross industry standard process for data mining
An iterative process, that often involves going back-and-forth between stage
In practice what should happen in cross industry standard process for data mining
Shortcuts from each stage back to the prior one
What does CRISP-DM stand for
Cross industry standard process for data mining
Whats the starting point and goal of CRISP-DM
Solve a business problem
What should the business problem
Important and solvable
How will the solution to the business problem be built
By using data as the raw material
What do you need to understand
The strengths and weaknesses of the data
Crisp-dm needs to weigh up what
Benefits and costs of aquiring
What happens in the preparation stage
Clean the data
Decide on which variables you require
What happens when you “clean” the data
Convert data to different types
Deal with missing values
Normalize or scale variables
how do you decide on which variables you require
Can be guided by theory
In machine learning known as “feature engineering”
What happens in the modelling and evaluation stages
May use various tools to help model the data
Need to evaluate our model rigorously
Important that your model is comprehensible
why do we evaluate our model
Beware of correlations by chance, p-hacking and overfitting
What happens in the deployment stage
Important to understand the benefits and risks of deployment
Continuous monitoring is often required
How is continuous monitoring required
Such monitoring detects worsening or unexpected model performance
Allows timely remediation actions such as adding new variables or retraining your model
Where does big data come from
Everywhere
Specific examples of big data
Internet interactions
Text documents
images and videos
Whats the widely used definition of big data
Big data is any set of data that is too large or too complex to be handled using conventional data-processing techniques
Whats a synonym for big data
Alternative data
What is the 4 v’s
Volume
Velocity
variety
varacity
What is Volume in 4 Vs of big data
terabytes to exabytes of existing data to process
What is Velocity in 4 Vs of big data
Streaming data, milliseconds to seconds to respond
What is Variety in 4 Vs of big data
Structured, unstructured, text multimedia
What is Varacity in 4 Vs of big data
Uncertainty due to data inconsistency and incompleteness
How is data when talking about volume
data at rest
How is data when talking about velocity
Data in motion
How is data when talking about variety
Data in many forms
How is data when talking about varacity
Data in doubt
What are the additional Vs
Variability
Value proposition
What does the Vs of big data mean is not possible
To store and process all the data on a single computer
What can big data help with
Process efficiency
Finding new connections in data
Improving predictions
Needs to be analysed to have value
Data analytics is the process of what
Discovering patterns and relationships in data
What are the 4 types of analytics
Descriptive analytics
Diagnostic analytics
Predictive analytics
Prescriptive analytics
What is descriptive analytics and an example
What has happened -> describes something that happened -> 50% returns in a month
What is diagnostic analytics
Why has something happened -> Describes the reason for the historical results -> Customers often return as not what they expected
Predictive analytics and an example
What will happen if? -> determines what will happen by analyzing past data -> next quarter looks like a decline
What is prescriptive analytics
What to do to make it happen -> Use info from other 3 to suggest a decision
What is artificial intelligence
Computer models or systems that exhibit intelligent behavior like humans
We currently have what
narrow ai systems
What is Narrow AI
Specialize in specific tasks
What is machine learning
Study of algorithms that:
- improve their performance
- at some task
- with experience
Example of machine learning
T: Playing chess
P: Percent games won against an opponent
E: Playing games against itself
What are the 2 types of machine learning
Supervised
Unsupervised
What is supervised machine learning
have training data with desired outputs (labels)
needs a stable environment
Focus on prediction
What is unsupervised machine learning
Only have training data without labels
No feedback
Focus on finding groups of similar items based on the data
What are the 5 approaches of data analysis
Traditional econometrics
Supervised learning
Unsupervised
Traditional programming
machine learning programs
What approaches have labeled data
Traditional econometrics
Supervised learning
What has unlabeled data
Unsupervised learning
What method is traditional econometrics
Linear regression
What methods are supervised learning and unsupervised learning
Supervised Machine learning
Unsupervised machine learning
Results in traditional econometrics
Explanatory model and statistical significance
what is the results of supervised learning
Prediction model and prediction performance
What is the results for unsupervised learning
Data structure model and data structure characteristics
what is traditional programming
Write a program with explicit rules to follow
What is machine learning programs
write a computer program to learn from examples
Supervised machine learning uses data for what
To learn a hypothesis to predict
Supervised machine learning uses what
classification models
Regression models
When do you use classification models vs Regression models
Class -> Target variable categorical
Reg -> target variable cont
Whats optimization in Supervised machine learning
How is the model trained on the data
whats representation for Supervised machine learning
How is the data specified
What is the form of the model
whats evaluation in Supervised machine learning
How are we assessing if model is successful
Whats the performance measure
AI wins when info is what
More transparent and voluminous
Humans win when institutional knowledge is what
Crucial
Performance edge of Ai what over time
Declines over time when alternative data is found
Combing ai and main produces what
The most accurate forecasts
Applications of AI and ML in finance
asset management
call centres
credit and insurance
When will larger training datasets improve prediction accuracy
if given X a human can confidently predict Y then yes
ML techniques are valuable when:
Have lots of features and training examples
Impact of features is highly nonlinear
prediction is more important that inference
Some ML approaches require alot of what
Computing power
Whats one solution to ML needing high computing power
Cloud computing services