1.11 intro to big data techniques Flashcards
Big Data datasets have traditionally been defined by three key characteristics:
Volume: Data volumes have grown from megabytes to petabytes.
Velocity: High velocity is possible because more data are now available in real-time.
Variety: Dataset were once limited to structured data (e.g., SQL table), but now include semi-structured data (e.g., HTML code) and even unstructured data from non-traditional sources such as social media, emails, and text messages.
More recently, a “fourth V” — veracity — has become an increasingly important characteristic. With Big Data being relied upon as the basis for making inferences and predictions, analysts must be able to trust the reliability and credibility of data sources.
Big Data sources include:
financial markets
businesses (e.g., corporate financials)
governments (e.g., trade and economic data)
individuals (e.g., credit card purchases, internet search logs, and social media posts)
sensors (e.g., satellite imagery and traffic patterns)
Internet of Things (e.g., data from “smart” buildings)
The three main sources of alternative data are generated by
individuals, business processes, and sensors.
Challenges with Big Data include
quality, volume, and appropriateness
Artificial intelligence (AI)
enables computers to perform tasks that traditionally have required human intelligence.
AI has been used by financial institutions since the 1980s, beginning with neural networks used to detect credit card fraud.
Machine learning (ML) algorithms
computer programs that learn how to complete tasks, improving with time as more data have become available.
ML models are trained to map relationships between inputs and outputs.
A training dataset is used to create the model, and other datasets are used to validate the model.
Algorithms are not explicitly programmed, which can result in outcomes that are not easily understood.
Human judgment is needed to ensure data quality.
Potential errors include overfitting, which occurs when an ML model learns the inputs and target dataset too well. Overfitting can cause ML models to treat noise in the data as true parameters.
Underfitting can cause the ML model to treat true parameters as noise. These models may be too simplistic.
Types of Machine Learning
Machine learning includes supervised and unsupervised learning.
In supervised learning
computers learn from labeled training data.
The inputs and outputs are identified for the algorithm.
This approach could be used to identify the best variable to forecast future stock returns.
unsupervised learning
only the dataset is provided to the algorithm
The inputs and outputs are not labeled.
This could be used to group companies into peer groups.
Deep learning
utilizes neural networks to identify patterns with a multistage approach.
This leads to an understanding of simple concepts that can be used to create more complex concepts.
These algorithms are used in applications such as image and speech recognition.
Data Processing Methods
Data capture – Collecting data and transforming them to a usable format. Low-latency systems communicate high volumes of data with minimal delay, which is needed for automated trading.
Data curation – Cleaning data to ensure high quality.
Data storage – Recording, archiving, and accessing data.
Search – Locating specific information in large datasets.
Transfer – Moving data from their source or storage location to the analytical tool.
Data science
uses computer science and statistics to extract information from data.
Data Visualization
how the data will be displayed and summarized in graphical form
Text analytics
use computer programs to analyze unstructured text or voice-based datasets
This could be from data such as company filings, social media, and email.
These methods can be used to identify predictors of market movements and economic indicators such as consumer confidence.
Natural language processing (NLP)
an application of text analytics that focuses on interpreting human language
commonly used for tasks such as translation, speech recognition, text mining, and sentiment analysis.
It can also be used to monitor communications among employees to ensure compliance with policies.