Big Data Flashcards
Big data terms
6 V’s of big data
Volume Variety Velocity Veracity Value Vulnerability
Veracity
the data must be
authentic,
credible, and
available
Variety
The data is no longer (only) structured, so we have to forget that everything can be fitted in a traditional database. We must be prepared to add new data sources, with all kind of formats; ranging from plain text to multimedia contents
Volume
The amount of data collected absurdly grows every minute, and we have the need to adapt our storage and processing tools to that volume, using distributed solutions (use of multiple machines, instead of one very — VERY — expensive supercomputer / mainframe)
Velocity
The urgency required for the data to be processed, is linked to the frequency of its generation / acquisition, and the need to use them in decisions making as quickly as possible; even in real time (or almost).
Value
the data must have value for the business or for society
Vulnerability
the data must comply with legality, respect privacy, and be stored and accessed in a safe way
2 types of classic machine learning
1) supervised
2) unsupervised
Supervised ML
when the training data is “labeled”.
This means that, for each sample, we have the values corresponding to the observed variables (the inputs) and the variable we wanna learn to predict or classify (the output, target, or dependent variable).
Withing this type we find
A) the regression algorithms (those predicting a numerical value) and
B) the classification algorithms (when the output is limited to certain categorical values)
Unsupervised ML
when the training data is not labeled (we don’t have a target variable).
The goal here is to find some kind of structure or pattern, for example to group the training samples, so we’ll be able to classify future samples.
Machine Learning (modern, sophisticated)
Ensemble methods
Reinforcement learning
Deep learning
Ensemble methods
basically it’s the joint use of several algorithms to obtain better results by combining their results.
The most common example is Random Forests, although XGBoost has become very famous because of its victories in Kaggle
Reinforcement learning
the machine learns from trial and error, thanks to the feedback it gets in response to the iterations with its surrounding environment.
You may have heard about AlphaGo (world’s best Go player) or AlphaStar (capable of crushing us in Starcraft II)
Deep learning
It’s based on the use of artificial neural networks. An artificial neural network is a computational model, with a layered structure, formed by interconnected nodes that work together.
Using graphic Processing Units (GPUs) has improved Deep Learning speed and cost
Programming basics
Data types strings Arrays Loops Conditions Variables functions
Interpreted vs compiled vs Byte code
Interpreted sends all source code
Compiled sends only machine code (doesn’t cross platform)
Intermediate, decided how much, called Byte code
Compiled C, C++, Objective C
Interpreted PHP, JavaScript
Hybrid Java, C#, VB.Net, Python
JavaScript
Meant for manipulating web pages with the interpreter in the web browser
Unlike objective C, C++, or Java, which runs directly on the operating system.
Vb is interpreted by MS office
ActionScrip is interpreted by Flash
Wealky typed language
Vs
Strongly typed language
Wealky typed language = variables do not need to be defined
Vs
Strongly typed language = variable type (e.g. integer, float, string) must be defined
Escape the quotes
Vs
Comment out
Use a backslash if you need to use double quotes inside a string.
“He said, "that’s fine," and left.”
Use \n to carriage returN in a string
But use forward slash to comment.
// Single line comment
/* Multiple line
comment */
Operator to add a value to a variable without creating a new variable
Increment operator, Decrement operator
+=
Score = score + 10
Score += 10
+= -= *= /=
If value is 1, just ++ –