1.11 intro to big data techniques Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

Big Data datasets have traditionally been defined by three key characteristics:

A

Volume: Data volumes have grown from megabytes to petabytes.

Velocity: High velocity is possible because more data are now available in real-time.

Variety: Dataset were once limited to structured data (e.g., SQL table), but now include semi-structured data (e.g., HTML code) and even unstructured data from non-traditional sources such as social media, emails, and text messages.

More recently, a “fourth V” — veracity — has become an increasingly important characteristic. With Big Data being relied upon as the basis for making inferences and predictions, analysts must be able to trust the reliability and credibility of data sources.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Big Data sources include:

A

financial markets

businesses (e.g., corporate financials)

governments (e.g., trade and economic data)

individuals (e.g., credit card purchases, internet search logs, and social media posts)

sensors (e.g., satellite imagery and traffic patterns)
Internet of Things (e.g., data from “smart” buildings)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

The three main sources of alternative data are generated by

A

individuals, business processes, and sensors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Challenges with Big Data include

A

quality, volume, and appropriateness

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Artificial intelligence (AI)

A

enables computers to perform tasks that traditionally have required human intelligence.

AI has been used by financial institutions since the 1980s, beginning with neural networks used to detect credit card fraud.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Machine learning (ML) algorithms

A

computer programs that learn how to complete tasks, improving with time as more data have become available.

ML models are trained to map relationships between inputs and outputs.

A training dataset is used to create the model, and other datasets are used to validate the model.

Algorithms are not explicitly programmed, which can result in outcomes that are not easily understood.

Human judgment is needed to ensure data quality.

Potential errors include overfitting, which occurs when an ML model learns the inputs and target dataset too well. Overfitting can cause ML models to treat noise in the data as true parameters.

Underfitting can cause the ML model to treat true parameters as noise. These models may be too simplistic.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Types of Machine Learning

A

Machine learning includes supervised and unsupervised learning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

In supervised learning

A

computers learn from labeled training data.

The inputs and outputs are identified for the algorithm.

This approach could be used to identify the best variable to forecast future stock returns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

unsupervised learning

A

only the dataset is provided to the algorithm

The inputs and outputs are not labeled.

This could be used to group companies into peer groups.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Deep learning

A

utilizes neural networks to identify patterns with a multistage approach.

This leads to an understanding of simple concepts that can be used to create more complex concepts.

These algorithms are used in applications such as image and speech recognition.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Data Processing Methods

A

Data capture – Collecting data and transforming them to a usable format. Low-latency systems communicate high volumes of data with minimal delay, which is needed for automated trading.

Data curation – Cleaning data to ensure high quality.

Data storage – Recording, archiving, and accessing data.

Search – Locating specific information in large datasets.

Transfer – Moving data from their source or storage location to the analytical tool.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Data science

A

uses computer science and statistics to extract information from data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Data Visualization

A

how the data will be displayed and summarized in graphical form

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Text analytics

A

use computer programs to analyze unstructured text or voice-based datasets

This could be from data such as company filings, social media, and email.

These methods can be used to identify predictors of market movements and economic indicators such as consumer confidence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Natural language processing (NLP)

A

an application of text analytics that focuses on interpreting human language

commonly used for tasks such as translation, speech recognition, text mining, and sentiment analysis.

It can also be used to monitor communications among employees to ensure compliance with policies.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Which of the following is a data visualization tool that is most likely used to display relationships between words that appear in textual data files?

A
Mind map

B
Tag cloud

C
Neural network

A

A
Mind map

Unlike a tag cloud, which displays words in varying sizes according to the frequency that they are used in a textual data file, a mind map is a data visualization technique that shows how different concepts are related to each other.

16
Q

Data curation is most accurately described as the process of:

A
identifying and correcting for data errors.

B
formatting and summarizing data in graphical form.

C
transforming data into a format that can be used for analysis.

A

C
transforming data into a format that can be used for analysis.

The objective of the data curation process is to ensure high quality, accurate data. Any errors in a dataset are identified and appropriate action is taken. For example, it may be necessary to make adjustments for missing data that was unavailable or had to be removed.

17
Q

Text analytics is appropriate for application to:

A
economic trend analysis.

B
large, structured datasets.

C
public but not private information.

A

A
economic trend analysis.

18
Q

HTML code is most accurately classified as:

A
structured data.

B
unstructured data.

C
semistructured data.

A

C
semistructured data.

19
Q

A machine learning model that has been underfit will most likely:

A
treat noise in a training dataset as true parameters.

B
fail to recognize true relationships in a training dataset.

C
identify relationships in a training dataset that are not found in the validation dataset.

A

B
fail to recognize true relationships in a training dataset.

20
Q

Which of the following statements about machine learning is most likely correct?

A
Algorithms are not explicitly programmed

B
The data given to algorithms are explicitly labeled as either inputs or outputs

C
Relationships between inputs and outputs are initially identified in an evaluation dataset

A

A
Algorithms are not explicitly programmed

21
Q

Retail point-of-sale scanner data are most accurately classified as:

A
sensor data.

B
corporate exhaust.

C
part of the Internet of Things.

A

Retail point-of-sale scanner data are an example of corporate exhaust that is generated by business processes to provide a real-time indication of business performance.