Topic 1: Introduction to Data Science & Alternative Data Flashcards
What gave rise to the realm of data science?
- Investments in business infrastructure
- Volume and variety of data
- Powerful computers
Why is data mining used for customer relationship management?
To manage attrition and maximize expected customer value
Type I and Type II Data Driven Decision (DDD) Making Problems
Type I: Decisions for which discoveries need to be made within the data
Type II: Increase decision making accuracy based on data analysis
Why view data and data science capability as a strategic asset?
Viewing these as assets allows us to think explicitly about the extent to which one should invest in them
How do you transition a business problem into a data mining problem?
Convert the business problem into subtasks and match the subtasks to known tasks for which tools are available.
Difference between regression and classification
classification predicts whether something will happen, regression predicts how much something will happen.
Define Classification
predict, for each individual, in a population which set of classes this individual belongs to
What does a regression attempt to do?
attempts to estimate or predict, for each individual, the numeric value of some variable for that individual
Similarity matching
attempts to identify similar individuals based on data known about them
Clustering
attempts to group individuals in a population together by their similarity, but not driven by any specific purpose.
Co-occurrence grouping
attempts to find associations between entities based on transactions involving them
“what items are often purchased together”
Profiling (behavior description)
attempts to characterize the typical behavior of an individual, group, or population.
Link prediction
attempts to predict connections between data items
Data reduction
attempts to take a large dataset and replace it with a smaller set of data
Causal modeling
helps us understand what events or actions actually influence others
Difference between supervised and unsupervised methods
Unsupervised methods: no specific purpose or target has been specified for grouping
Two main subclasses of supervised data mining
Classification (binary target) and regression (numerical target)
CRISP-DM process
- Business understanding
- Data Understanding
- Data Preparation
- Modeling
- Evaluation
- Deployment
Leakage in data preparation
Including variables from historical data that gives information on the target variable (predicting the future)
Purpose of the evaluation stage (in CRISP-DM)
- Assess the data mining results rigorously and
2 serves to help ensure that the model satisfies the original business goals
Difference between Statistics and the process of Data Mining
Data Mining is hypothesis generation (may produce numerical estimates), while statistics focusses mainly on hypothesis testing (can we have confidence in these estimates)
Define a query
A specific request for a subset/statistics of data, formulated in a technical language and posed to a database system.
Difference between Knowledge Discovery and Data Mining (“KDD”) and Machine Learning
KDD is more focused on problems concerning “real world”. Also KDD tends to be more concentrated with the entire process of data analytics.
Discuss investment managers in terms of their places on the diffusion of innovations curve.
Innovators - mostly hedge funds
Early adopters - aggressive long-only and PE mgrs.
Early majority - tech savvy large complex IM firms
Late majority - traditional large complex IM firms
Laggards - reluctant firms
Name the unique challenge to alternative data as described in the paper Alternative Data for Investment Decisions?
Standard historical data may not exist
Name the risk exposures to early adaptation of alternative data as described in the paper Alternative Data for Investment Decisions ?
- Model risk
- Regulatory risk
- Data risk
- Talent risk
What are the four types of data risks as described in the paper Alternative Data for Investment Decisions?
Data provenance risk - risk related to origin of data
Accuracy or validity risk - bad trading signals
Material nonpublic information (MNPI) risk
Privacy risk - posibility of PII information in data set
Name the risk exposures to late adaptation of alternative data as described in the paper Alternative Data for Investment Decisions ?
- Positioning risk
- Execution risk
- Consequence risk
Categories among the Big Data Analytic groups that are most aligned to support alpha generation
- Content analytics - extract value from text
- Advanced and predictive analytics software tools
- Spatial information analytics (SIA) tools - geographic information software and tools
Describe the term collective intelligence investing
The process of gathering insights from online communities and crowdsourcing.
What are the four key platform types offering Collective Intelligence Investing (CII)
- Open communities
- Digital expert contribution networks
- Digital expert communication networks
- Crowdsourcing platforms
What are the risk exposures of Collective Intelligence Investing and their mitigants?
- Community engagement risk - adopt gamification
- Material nonpublic information risk - Rigorous DD
- Model risk - Sufficient testing/sturdiness checks
- Information security risk - better security
- Data integrity risk
Steps to a potentially smooth takeoff for Collective Intelligence Investing
- Vendor review
- Thorough risk assessment
- Customized technology architecture
Key considerations when you want Wall Street to take notice of your data (in case you want to sell it)
- Data productization (know how the client uses the data)
- Infrastructure and delivery (how will you deliver data)
- Distribution (you need an ‘in’ to get them to look at your data)
What are some of the factors that determine the commercial value of a dataset?
- Data Edge
- Monetization Strategy
- Deep Market
- Uniqueness and Replicability
- Exclusive Access
- Table Stakes Potential
What is the single strongest indicator of how much a client will pay for a dataset? And what is the expected return on a data investment?
How big the client is in terms of AuM. 10-20x