Topic 1: Introduction to Data Science & Alternative Data Flashcards

1
Q

What gave rise to the realm of data science?

A
  1. Investments in business infrastructure
  2. Volume and variety of data
  3. Powerful computers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why is data mining used for customer relationship management?

A

To manage attrition and maximize expected customer value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Type I and Type II Data Driven Decision (DDD) Making Problems

A

Type I: Decisions for which discoveries need to be made within the data

Type II: Increase decision making accuracy based on data analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Why view data and data science capability as a strategic asset?

A

Viewing these as assets allows us to think explicitly about the extent to which one should invest in them

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How do you transition a business problem into a data mining problem?

A

Convert the business problem into subtasks and match the subtasks to known tasks for which tools are available.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Difference between regression and classification

A

classification predicts whether something will happen, regression predicts how much something will happen.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Define Classification

A

predict, for each individual, in a population which set of classes this individual belongs to

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What does a regression attempt to do?

A

attempts to estimate or predict, for each individual, the numeric value of some variable for that individual

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Similarity matching

A

attempts to identify similar individuals based on data known about them

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Clustering

A

attempts to group individuals in a population together by their similarity, but not driven by any specific purpose.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Co-occurrence grouping

A

attempts to find associations between entities based on transactions involving them

“what items are often purchased together”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Profiling (behavior description)

A

attempts to characterize the typical behavior of an individual, group, or population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Link prediction

A

attempts to predict connections between data items

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Data reduction

A

attempts to take a large dataset and replace it with a smaller set of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Causal modeling

A

helps us understand what events or actions actually influence others

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Difference between supervised and unsupervised methods

A

Unsupervised methods: no specific purpose or target has been specified for grouping

17
Q

Two main subclasses of supervised data mining

A

Classification (binary target) and regression (numerical target)

18
Q

CRISP-DM process

A
  1. Business understanding
  2. Data Understanding
  3. Data Preparation
  4. Modeling
  5. Evaluation
  6. Deployment
19
Q

Leakage in data preparation

A

Including variables from historical data that gives information on the target variable (predicting the future)

20
Q

Purpose of the evaluation stage (in CRISP-DM)

A
  1. Assess the data mining results rigorously and

2 serves to help ensure that the model satisfies the original business goals

21
Q

Difference between Statistics and the process of Data Mining

A

Data Mining is hypothesis generation (may produce numerical estimates), while statistics focusses mainly on hypothesis testing (can we have confidence in these estimates)

22
Q

Define a query

A

A specific request for a subset/statistics of data, formulated in a technical language and posed to a database system.

23
Q

Difference between Knowledge Discovery and Data Mining (“KDD”) and Machine Learning

A

KDD is more focused on problems concerning “real world”. Also KDD tends to be more concentrated with the entire process of data analytics.

24
Q

Discuss investment managers in terms of their places on the diffusion of innovations curve.

A

Innovators - mostly hedge funds
Early adopters - aggressive long-only and PE mgrs.
Early majority - tech savvy large complex IM firms
Late majority - traditional large complex IM firms
Laggards - reluctant firms

25
Q

Name the unique challenge to alternative data as described in the paper Alternative Data for Investment Decisions?

A

Standard historical data may not exist

26
Q

Name the risk exposures to early adaptation of alternative data as described in the paper Alternative Data for Investment Decisions ?

A
  1. Model risk
  2. Regulatory risk
  3. Data risk
  4. Talent risk
27
Q

What are the four types of data risks as described in the paper Alternative Data for Investment Decisions?

A

Data provenance risk - risk related to origin of data
Accuracy or validity risk - bad trading signals
Material nonpublic information (MNPI) risk
Privacy risk - posibility of PII information in data set

28
Q

Name the risk exposures to late adaptation of alternative data as described in the paper Alternative Data for Investment Decisions ?

A
  1. Positioning risk
  2. Execution risk
  3. Consequence risk
29
Q

Categories among the Big Data Analytic groups that are most aligned to support alpha generation

A
  1. Content analytics - extract value from text
  2. Advanced and predictive analytics software tools
  3. Spatial information analytics (SIA) tools - geographic information software and tools
30
Q

Describe the term collective intelligence investing

A

The process of gathering insights from online communities and crowdsourcing.

31
Q

What are the four key platform types offering Collective Intelligence Investing (CII)

A
  1. Open communities
  2. Digital expert contribution networks
  3. Digital expert communication networks
  4. Crowdsourcing platforms
32
Q

What are the risk exposures of Collective Intelligence Investing and their mitigants?

A
  1. Community engagement risk - adopt gamification
  2. Material nonpublic information risk - Rigorous DD
  3. Model risk - Sufficient testing/sturdiness checks
  4. Information security risk - better security
  5. Data integrity risk
33
Q

Steps to a potentially smooth takeoff for Collective Intelligence Investing

A
  1. Vendor review
  2. Thorough risk assessment
  3. Customized technology architecture
34
Q

Key considerations when you want Wall Street to take notice of your data (in case you want to sell it)

A
  • Data productization (know how the client uses the data)
  • Infrastructure and delivery (how will you deliver data)
  • Distribution (you need an ‘in’ to get them to look at your data)
35
Q

What are some of the factors that determine the commercial value of a dataset?

A
  1. Data Edge
  2. Monetization Strategy
  3. Deep Market
  4. Uniqueness and Replicability
  5. Exclusive Access
  6. Table Stakes Potential
36
Q

What is the single strongest indicator of how much a client will pay for a dataset? And what is the expected return on a data investment?

A

How big the client is in terms of AuM. 10-20x