TOPIC 3 - BIG DATA, DATA ANALYTICS, AND MACHINE LEARNING Flashcards
How can data help businesses
Smarter and faster decisions
Accurate predictions
Sorting the signal from the noise
Efficient operations including real-time changes
Data is often viewed as what by companies
An asset
What is the cross industry standard process for data mining
An iterative process, that often involves going back-and-forth between stage
In practice what should happen in cross industry standard process for data mining
Shortcuts from each stage back to the prior one
What does CRISP-DM stand for
Cross industry standard process for data mining
Whats the starting point and goal of CRISP-DM
Solve a business problem
What should the business problem
Important and solvable
How will the solution to the business problem be built
By using data as the raw material
What do you need to understand
The strengths and weaknesses of the data
Crisp-dm needs to weigh up what
Benefits and costs of aquiring
What happens in the preparation stage
Clean the data
Decide on which variables you require
What happens when you “clean” the data
Convert data to different types
Deal with missing values
Normalize or scale variables
how do you decide on which variables you require
Can be guided by theory
In machine learning known as “feature engineering”
What happens in the modelling and evaluation stages
May use various tools to help model the data
Need to evaluate our model rigorously
Important that your model is comprehensible
why do we evaluate our model
Beware of correlations by chance, p-hacking and overfitting
What happens in the deployment stage
Important to understand the benefits and risks of deployment
Continuous monitoring is often required
How is continuous monitoring required
Such monitoring detects worsening or unexpected model performance
Allows timely remediation actions such as adding new variables or retraining your model
Where does big data come from
Everywhere
Specific examples of big data
Internet interactions
Text documents
images and videos
Whats the widely used definition of big data
Big data is any set of data that is too large or too complex to be handled using conventional data-processing techniques
Whats a synonym for big data
Alternative data
What is the 4 v’s
Volume
Velocity
variety
varacity
What is Volume in 4 Vs of big data
terabytes to exabytes of existing data to process
What is Velocity in 4 Vs of big data
Streaming data, milliseconds to seconds to respond
What is Variety in 4 Vs of big data
Structured, unstructured, text multimedia
What is Varacity in 4 Vs of big data
Uncertainty due to data inconsistency and incompleteness
How is data when talking about volume
data at rest
How is data when talking about velocity
Data in motion
How is data when talking about variety
Data in many forms