1.11 intro to big data techniques Flashcards
Big Data datasets have traditionally been defined by three key characteristics:
Volume: Data volumes have grown from megabytes to petabytes.
Velocity: High velocity is possible because more data are now available in real-time.
Variety: Dataset were once limited to structured data (e.g., SQL table), but now include semi-structured data (e.g., HTML code) and even unstructured data from non-traditional sources such as social media, emails, and text messages.
More recently, a “fourth V” — veracity — has become an increasingly important characteristic. With Big Data being relied upon as the basis for making inferences and predictions, analysts must be able to trust the reliability and credibility of data sources.
Big Data sources include:
financial markets
businesses (e.g., corporate financials)
governments (e.g., trade and economic data)
individuals (e.g., credit card purchases, internet search logs, and social media posts)
sensors (e.g., satellite imagery and traffic patterns)
Internet of Things (e.g., data from “smart” buildings)
The three main sources of alternative data are generated by
individuals, business processes, and sensors.
Challenges with Big Data include
quality, volume, and appropriateness
Artificial intelligence (AI)
enables computers to perform tasks that traditionally have required human intelligence.
AI has been used by financial institutions since the 1980s, beginning with neural networks used to detect credit card fraud.
Machine learning (ML) algorithms
computer programs that learn how to complete tasks, improving with time as more data have become available.
ML models are trained to map relationships between inputs and outputs.
A training dataset is used to create the model, and other datasets are used to validate the model.
Algorithms are not explicitly programmed, which can result in outcomes that are not easily understood.
Human judgment is needed to ensure data quality.
Potential errors include overfitting, which occurs when an ML model learns the inputs and target dataset too well. Overfitting can cause ML models to treat noise in the data as true parameters.
Underfitting can cause the ML model to treat true parameters as noise. These models may be too simplistic.
Types of Machine Learning
Machine learning includes supervised and unsupervised learning.
In supervised learning
computers learn from labeled training data.
The inputs and outputs are identified for the algorithm.
This approach could be used to identify the best variable to forecast future stock returns.
unsupervised learning
only the dataset is provided to the algorithm
The inputs and outputs are not labeled.
This could be used to group companies into peer groups.
Deep learning
utilizes neural networks to identify patterns with a multistage approach.
This leads to an understanding of simple concepts that can be used to create more complex concepts.
These algorithms are used in applications such as image and speech recognition.
Data Processing Methods
Data capture – Collecting data and transforming them to a usable format. Low-latency systems communicate high volumes of data with minimal delay, which is needed for automated trading.
Data curation – Cleaning data to ensure high quality.
Data storage – Recording, archiving, and accessing data.
Search – Locating specific information in large datasets.
Transfer – Moving data from their source or storage location to the analytical tool.
Data science
uses computer science and statistics to extract information from data.
Data Visualization
how the data will be displayed and summarized in graphical form
Text analytics
use computer programs to analyze unstructured text or voice-based datasets
This could be from data such as company filings, social media, and email.
These methods can be used to identify predictors of market movements and economic indicators such as consumer confidence.
Natural language processing (NLP)
an application of text analytics that focuses on interpreting human language
commonly used for tasks such as translation, speech recognition, text mining, and sentiment analysis.
It can also be used to monitor communications among employees to ensure compliance with policies.
Which of the following is a data visualization tool that is most likely used to display relationships between words that appear in textual data files?
A
Mind map
B
Tag cloud
C
Neural network
A
Mind map
Unlike a tag cloud, which displays words in varying sizes according to the frequency that they are used in a textual data file, a mind map is a data visualization technique that shows how different concepts are related to each other.
Data curation is most accurately described as the process of:
A
identifying and correcting for data errors.
B
formatting and summarizing data in graphical form.
C
transforming data into a format that can be used for analysis.
C
transforming data into a format that can be used for analysis.
The objective of the data curation process is to ensure high quality, accurate data. Any errors in a dataset are identified and appropriate action is taken. For example, it may be necessary to make adjustments for missing data that was unavailable or had to be removed.
Text analytics is appropriate for application to:
A
economic trend analysis.
B
large, structured datasets.
C
public but not private information.
A
economic trend analysis.
HTML code is most accurately classified as:
A
structured data.
B
unstructured data.
C
semistructured data.
C
semistructured data.
A machine learning model that has been underfit will most likely:
A
treat noise in a training dataset as true parameters.
B
fail to recognize true relationships in a training dataset.
C
identify relationships in a training dataset that are not found in the validation dataset.
B
fail to recognize true relationships in a training dataset.
Which of the following statements about machine learning is most likely correct?
A
Algorithms are not explicitly programmed
B
The data given to algorithms are explicitly labeled as either inputs or outputs
C
Relationships between inputs and outputs are initially identified in an evaluation dataset
A
Algorithms are not explicitly programmed
Retail point-of-sale scanner data are most accurately classified as:
A
sensor data.
B
corporate exhaust.
C
part of the Internet of Things.
Retail point-of-sale scanner data are an example of corporate exhaust that is generated by business processes to provide a real-time indication of business performance.