Big Data Flashcards
Big data terms
6 V’s of big data
Volume Variety Velocity Veracity Value Vulnerability
Veracity
the data must be
authentic,
credible, and
available
Variety
The data is no longer (only) structured, so we have to forget that everything can be fitted in a traditional database. We must be prepared to add new data sources, with all kind of formats; ranging from plain text to multimedia contents
Volume
The amount of data collected absurdly grows every minute, and we have the need to adapt our storage and processing tools to that volume, using distributed solutions (use of multiple machines, instead of one very — VERY — expensive supercomputer / mainframe)
Velocity
The urgency required for the data to be processed, is linked to the frequency of its generation / acquisition, and the need to use them in decisions making as quickly as possible; even in real time (or almost).
Value
the data must have value for the business or for society
Vulnerability
the data must comply with legality, respect privacy, and be stored and accessed in a safe way
2 types of classic machine learning
1) supervised
2) unsupervised
Supervised ML
when the training data is “labeled”.
This means that, for each sample, we have the values corresponding to the observed variables (the inputs) and the variable we wanna learn to predict or classify (the output, target, or dependent variable).
Withing this type we find
A) the regression algorithms (those predicting a numerical value) and
B) the classification algorithms (when the output is limited to certain categorical values)
Unsupervised ML
when the training data is not labeled (we don’t have a target variable).
The goal here is to find some kind of structure or pattern, for example to group the training samples, so we’ll be able to classify future samples.
Machine Learning (modern, sophisticated)
Ensemble methods
Reinforcement learning
Deep learning
Ensemble methods
basically it’s the joint use of several algorithms to obtain better results by combining their results.
The most common example is Random Forests, although XGBoost has become very famous because of its victories in Kaggle
Reinforcement learning
the machine learns from trial and error, thanks to the feedback it gets in response to the iterations with its surrounding environment.
You may have heard about AlphaGo (world’s best Go player) or AlphaStar (capable of crushing us in Starcraft II)
Deep learning
It’s based on the use of artificial neural networks. An artificial neural network is a computational model, with a layered structure, formed by interconnected nodes that work together.
Using graphic Processing Units (GPUs) has improved Deep Learning speed and cost
Programming basics
Data types strings Arrays Loops Conditions Variables functions