1.11 intro to big data techniques Flashcards

Question 1

Q

Big Data datasets have traditionally been defined by three key characteristics:

Answer

A

Volume: Data volumes have grown from megabytes to petabytes.

Velocity: High velocity is possible because more data are now available in real-time.

Variety: Dataset were once limited to structured data (e.g., SQL table), but now include semi-structured data (e.g., HTML code) and even unstructured data from non-traditional sources such as social media, emails, and text messages.

More recently, a “fourth V” — veracity — has become an increasingly important characteristic. With Big Data being relied upon as the basis for making inferences and predictions, analysts must be able to trust the reliability and credibility of data sources.

Question 2

Q

Big Data sources include:

Answer

A

financial markets

businesses (e.g., corporate financials)

governments (e.g., trade and economic data)

individuals (e.g., credit card purchases, internet search logs, and social media posts)

sensors (e.g., satellite imagery and traffic patterns)
Internet of Things (e.g., data from “smart” buildings)

Question 3

Q

The three main sources of alternative data are generated by

Answer

A

individuals, business processes, and sensors.

Question 4

Q

Challenges with Big Data include

Answer

A

quality, volume, and appropriateness

Question 5

Q

Artificial intelligence (AI)

Answer

A

enables computers to perform tasks that traditionally have required human intelligence.

AI has been used by financial institutions since the 1980s, beginning with neural networks used to detect credit card fraud.

Question 6

Q

Machine learning (ML) algorithms

Answer

A

computer programs that learn how to complete tasks, improving with time as more data have become available.

ML models are trained to map relationships between inputs and outputs.

A training dataset is used to create the model, and other datasets are used to validate the model.

Algorithms are not explicitly programmed, which can result in outcomes that are not easily understood.

Human judgment is needed to ensure data quality.

Potential errors include overfitting, which occurs when an ML model learns the inputs and target dataset too well. Overfitting can cause ML models to treat noise in the data as true parameters.

Underfitting can cause the ML model to treat true parameters as noise. These models may be too simplistic.

Question 7

Q

Types of Machine Learning

Answer

A

Machine learning includes supervised and unsupervised learning.

Question 8

Q

In supervised learning

Answer

A

computers learn from labeled training data.

The inputs and outputs are identified for the algorithm.

This approach could be used to identify the best variable to forecast future stock returns.

Question 9

Q

unsupervised learning

Answer

A

only the dataset is provided to the algorithm

The inputs and outputs are not labeled.

This could be used to group companies into peer groups.

Question 10

Q

Deep learning

Answer

A

utilizes neural networks to identify patterns with a multistage approach.

This leads to an understanding of simple concepts that can be used to create more complex concepts.

These algorithms are used in applications such as image and speech recognition.

Question 11

Q

Data Processing Methods

Answer

A

Data capture – Collecting data and transforming them to a usable format. Low-latency systems communicate high volumes of data with minimal delay, which is needed for automated trading.

Data curation – Cleaning data to ensure high quality.

Data storage – Recording, archiving, and accessing data.

Search – Locating specific information in large datasets.

Transfer – Moving data from their source or storage location to the analytical tool.

Question 12

Q

Data science

Answer

A

uses computer science and statistics to extract information from data.

Question 13

Q

Data Visualization

Answer

A

how the data will be displayed and summarized in graphical form

Question 14

Q

Text analytics

Answer

A

use computer programs to analyze unstructured text or voice-based datasets

This could be from data such as company filings, social media, and email.

These methods can be used to identify predictors of market movements and economic indicators such as consumer confidence.

Question 15

Q

Natural language processing (NLP)

Answer

A

an application of text analytics that focuses on interpreting human language

commonly used for tasks such as translation, speech recognition, text mining, and sentiment analysis.

It can also be used to monitor communications among employees to ensure compliance with policies.

Question 16

Q

Which of the following is a data visualization tool that is most likely used to display relationships between words that appear in textual data files?

A
Mind map

B
Tag cloud

C
Neural network

Answer

Study These Flashcards

A

A
Mind map

Unlike a tag cloud, which displays words in varying sizes according to the frequency that they are used in a textual data file, a mind map is a data visualization technique that shows how different concepts are related to each other.

Question 17

Q

Data curation is most accurately described as the process of:

A
identifying and correcting for data errors.

B
formatting and summarizing data in graphical form.

C
transforming data into a format that can be used for analysis.

Answer

Study These Flashcards

A

C
transforming data into a format that can be used for analysis.

The objective of the data curation process is to ensure high quality, accurate data. Any errors in a dataset are identified and appropriate action is taken. For example, it may be necessary to make adjustments for missing data that was unavailable or had to be removed.

Question 18

Q

Text analytics is appropriate for application to:

A
economic trend analysis.

B
large, structured datasets.

C
public but not private information.

Answer

Study These Flashcards

A

A
economic trend analysis.

Question 19

Q

HTML code is most accurately classified as:

A
structured data.

B
unstructured data.

C
semistructured data.

Answer

Study These Flashcards

A

C
semistructured data.

Question 20

Q

A machine learning model that has been underfit will most likely:

A
treat noise in a training dataset as true parameters.

B
fail to recognize true relationships in a training dataset.

C
identify relationships in a training dataset that are not found in the validation dataset.

Answer

Study These Flashcards

A

B
fail to recognize true relationships in a training dataset.

Question 21

Q

Which of the following statements about machine learning is most likely correct?

A
Algorithms are not explicitly programmed

B
The data given to algorithms are explicitly labeled as either inputs or outputs

C
Relationships between inputs and outputs are initially identified in an evaluation dataset

Answer

Study These Flashcards

A

A
Algorithms are not explicitly programmed

Question 22

Q

Retail point-of-sale scanner data are most accurately classified as:

A
sensor data.

B
corporate exhaust.

C
part of the Internet of Things.

Answer

Study These Flashcards

A

Retail point-of-sale scanner data are an example of corporate exhaust that is generated by business processes to provide a real-time indication of business performance.

1.11 intro to big data techniques Flashcards

(22 cards)