Lecture 1 introduction Flashcards
What percentage of data produced is currently analyzed?
Only 0.5% of data is analyzed.
What unit measures the total volume of data produced globally?
Zettabytes (ZB) (1 ZB ≈ 1 trillion gigabytes).
What are the three Vs of big data?
Volume (size), Variety (heterogeneous sources), Velocity (speed of creation/analysis).
What does Velocity in big data refer to?
The speed at which data is created/analyzed (e.g., ‘data-in-motion’ vs. ‘data-at-rest’).
What is the fourth V in big data?
Veracity (data reliability/credibility).
What did the 2018 Twitter study find about falsehoods?
Falsehoods are 70% more likely to be retweeted than accurate news.
What is TAQ data in finance?
Trades and Quotes data: millisecond-level records of all trades/quotes on exchanges (e.g., NYSE, Nasdaq). Daily TAQ can exceed 100 million rows.
Define machine learning (ML).
Extracting knowledge from data; improves performance on tasks through experience.
What are supervised ML tasks?
Regression (predicting values) and classification (predicting classes).
Name 3 applications of ML in finance.
- Fraud prevention
- Algorithmic trading
- Loan underwriting
- Risk management
What is unsupervised ML?
Finds hidden patterns in unlabeled data via clustering or dimensionality reduction (e.g., PCA).
In reinforcement learning (RL), what is an agent?
The entity that performs actions in an environment to maximize rewards.
Name 3 RL applications in finance.
- Algorithmic trading
- Derivatives hedging
- Portfolio allocation
What are common supervised ML algorithms?
- Linear regression
- Decision trees
- Neural networks
- SVMs
- Random forests
How does Volume in big data evolve?
The threshold for ‘big’ data size is revised upward every year.
What are the key characteristics of supervised learning?
- Uses labeled data (input-output pairs)
- Direct feedback (errors are corrected during training)
- Predict outcomes (regression) or classify data (classification)
- Examples: Linear regression, decision trees, neural networks
What defines unsupervised learning?
- Works with unlabeled data
- No direct feedback (no ‘right answer’ provided)
- Discover hidden patterns or groupings
- Tasks: Clustering (e.g., customer segmentation) and dimensionality reduction (e.g., PCA)
How does reinforcement learning (RL) work?
- Learns via trial-and-error interactions with an environment
- Uses rewards/punishments (delayed feedback)
- Learn optimal policies to maximize long-term rewards
- Components: Agent, actions, environment, state, reward
- Finance use cases: Algorithmic trading, portfolio optimization
Dimensionality reduction ?
process of reducing the number of
features, or variables, in a dataset while preserving information and
overall model performance.
Clustering
allows us to discover hidden structures in data. The goal of
clustering is to find a natural grouping in data so that items in the same
cluster are more similar to each other than to those from different
clusters.
basic concept of RL
Agent: is the entity that performs actions.
2 Action: is what an agent can do in each state.
3 Environment: is the world in which the agent resides.
4 State: describes the current situation of the agent.
5 Reward: The immediate return sent by the environment to
evaluate the last action by the agent. A reward can be positive
(reward) or negative (punishment).
What is Artificial Intelligence (AI)?
Theory and development of computer systems capable of performing tasks requiring human intelligence
Examples include visual perception, speech recognition, decision-making, and language translation.
Define Machine Learning (ML).
A computer program learns from experience (E) related to a class of tasks (T) and performance measure (P)
Performance on tasks T (measured by P) improves with increased experience E.