1 Data Science & Big Data Flashcards
Define Data Science
Two definitions:
- Processing, representation, value-extraction, and knowledge extraction of data (study of data)
- Data-Driven Science (a tool for scientific discovery in other scientific fields). It complements Hypothesis-Driven Science
Hypothesis-Driven Science Process
- Hypothesis
- Perform Experiment
- Check Outcome
Data-Driven Science Process
- Take existing data
- Perform analysis
- Observe results
Physics Equations
Historical Books/Artifacts
Earth Data
Biomedical Data
User Activity
User Content
Sources of Data
Is data often intended for scientific use?
No. Normally its original purpose is for an activity of different purpose. Eg: clinical practice, accounting, sharing knowledge, etc.
Possible outcomes of a data science analysis
- Useful insight (can be published)
- Inform on further experiments to be conducted to verify a hypothesis
- Inform whether it is feasible to build a system that accurately predicts the data
What does it mean that data is high-dimensional
Data cannot be plotted easily in two dimensions
How machine learning based dimensionality reduction works
The machine tries to recreate the original distances between points (x) from a high-dimensional space using points (y) in a lower-dimensional space. It adjusts the y points until the distance differences between the two dimensions’ point distances are nearly zero.
Statistics vs Machine Learning
Statistics makes assumptions about the nature of the data while machine learning addresses the data as it occurs in real world applications.
ML for Data Science vs ML for Decision Making
ML for DS: generate presentable insights/hypotheses
ML for DM: focus on learning good autonomous decisions that can be taken for repetitive tasks without human intervention
Data Science / ML lingo for row
Instance (can also say data point)
N means number of data points (instances)
DS/ML lingo for column
Feature (can also say attribute)
D means number of features (attributes)
Metadata
Description of the data in the dataset. What it is actually about.
Eg: description of dataset, features and instances
“Data about the data”
Network Dataset Adjacency Matrix
Tabular structure of size NxN can be stored to represent Network Dataset. Checks for connection between each data point.
Large Datasets
Datasets whose size is too large to be processed with classical techniques (eg single computers, standard software). Must synchronize the model between different machines (computers)