Topic 1: Introduction to Data Science & Alternative Data Flashcards by Haiko Aragon

What gave rise to the realm of data science?

Investments in business infrastructure
Volume and variety of data
Powerful computers

How well did you know this?

Not at all

Perfectly

Why is data mining used for customer relationship management?

To manage attrition and maximize expected customer value

How well did you know this?

Not at all

Perfectly

Type I and Type II Data Driven Decision (DDD) Making Problems

Type I: Decisions for which discoveries need to be made within the data

Type II: Increase decision making accuracy based on data analysis

How well did you know this?

Not at all

Perfectly

Why view data and data science capability as a strategic asset?

Viewing these as assets allows us to think explicitly about the extent to which one should invest in them

How well did you know this?

Not at all

Perfectly

How do you transition a business problem into a data mining problem?

Convert the business problem into subtasks and match the subtasks to known tasks for which tools are available.

How well did you know this?

Not at all

Perfectly

Difference between regression and classification

classification predicts whether something will happen, regression predicts how much something will happen.

How well did you know this?

Not at all

Perfectly

Define Classification

predict, for each individual, in a population which set of classes this individual belongs to

How well did you know this?

Not at all

Perfectly

What does a regression attempt to do?

attempts to estimate or predict, for each individual, the numeric value of some variable for that individual

How well did you know this?

Not at all

Perfectly

Similarity matching

attempts to identify similar individuals based on data known about them

How well did you know this?

Not at all

Perfectly

Clustering

attempts to group individuals in a population together by their similarity, but not driven by any specific purpose.

How well did you know this?

Not at all

Perfectly

Co-occurrence grouping

attempts to find associations between entities based on transactions involving them

“what items are often purchased together”

How well did you know this?

Not at all

Perfectly

Profiling (behavior description)

attempts to characterize the typical behavior of an individual, group, or population.

How well did you know this?

Not at all

Perfectly

Link prediction

attempts to predict connections between data items

How well did you know this?

Not at all

Perfectly

Data reduction

attempts to take a large dataset and replace it with a smaller set of data

How well did you know this?

Not at all

Perfectly

Causal modeling

helps us understand what events or actions actually influence others

How well did you know this?

Not at all

Perfectly

Difference between supervised and unsupervised methods

Study These Flashcards

Unsupervised methods: no specific purpose or target has been specified for grouping

Two main subclasses of supervised data mining

Study These Flashcards

Classification (binary target) and regression (numerical target)

CRISP-DM process

Study These Flashcards

Business understanding
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment

Leakage in data preparation

Study These Flashcards

Including variables from historical data that gives information on the target variable (predicting the future)

Purpose of the evaluation stage (in CRISP-DM)

Study These Flashcards

Assess the data mining results rigorously and

2 serves to help ensure that the model satisfies the original business goals

Difference between Statistics and the process of Data Mining

Study These Flashcards

Data Mining is hypothesis generation (may produce numerical estimates), while statistics focusses mainly on hypothesis testing (can we have confidence in these estimates)

Define a query

Study These Flashcards

A specific request for a subset/statistics of data, formulated in a technical language and posed to a database system.

Difference between Knowledge Discovery and Data Mining (“KDD”) and Machine Learning

Study These Flashcards

KDD is more focused on problems concerning “real world”. Also KDD tends to be more concentrated with the entire process of data analytics.

Discuss investment managers in terms of their places on the diffusion of innovations curve.

Study These Flashcards

Innovators - mostly hedge funds
Early adopters - aggressive long-only and PE mgrs.
Early majority - tech savvy large complex IM firms
Late majority - traditional large complex IM firms
Laggards - reluctant firms

Name the unique challenge to alternative data as described in the paper Alternative Data for Investment Decisions?

Standard historical data may not exist

Name the risk exposures to early adaptation of alternative data as described in the paper Alternative Data for Investment Decisions ?

1. Model risk 2. Regulatory risk 3. Data risk 4. Talent risk

What are the four types of data risks as described in the paper Alternative Data for Investment Decisions?

Data provenance risk - risk related to origin of data Accuracy or validity risk - bad trading signals Material nonpublic information (MNPI) risk Privacy risk - posibility of PII information in data set

Name the risk exposures to late adaptation of alternative data as described in the paper Alternative Data for Investment Decisions ?

1. Positioning risk 2. Execution risk 3. Consequence risk

Categories among the Big Data Analytic groups that are most aligned to support alpha generation

1. Content analytics - extract value from text 2. Advanced and predictive analytics software tools 3. Spatial information analytics (SIA) tools - geographic information software and tools

Describe the term collective intelligence investing

The process of gathering insights from online communities and crowdsourcing.

What are the four key platform types offering Collective Intelligence Investing (CII)

1. Open communities 2. Digital expert contribution networks 3. Digital expert communication networks 4. Crowdsourcing platforms

What are the risk exposures of Collective Intelligence Investing and their mitigants?

1. Community engagement risk - adopt gamification 2. Material nonpublic information risk - Rigorous DD 3. Model risk - Sufficient testing/sturdiness checks 4. Information security risk - better security 5. Data integrity risk

Steps to a potentially smooth takeoff for Collective Intelligence Investing

1. Vendor review 2. Thorough risk assessment 3. Customized technology architecture

Key considerations when you want Wall Street to take notice of your data (in case you want to sell it)

- Data productization (know how the client uses the data) - Infrastructure and delivery (how will you deliver data) - Distribution (you need an 'in' to get them to look at your data)

What are some of the factors that determine the commercial value of a dataset?

1. Data Edge 2. Monetization Strategy 3. Deep Market 4. Uniqueness and Replicability 5. Exclusive Access 6. Table Stakes Potential

What is the single strongest indicator of how much a client will pay for a dataset? And what is the expected return on a data investment?

How big the client is in terms of AuM. 10-20x

Topic 1: Introduction to Data Science & Alternative Data Flashcards

(36 cards)