1 Data Science & Big Data Flashcards

1
Q

Define Data Science

A

Two definitions:
- Processing, representation, value-extraction, and knowledge extraction of data (study of data)
- Data-Driven Science (a tool for scientific discovery in other scientific fields). It complements Hypothesis-Driven Science

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Hypothesis-Driven Science Process

A
  1. Hypothesis
  2. Perform Experiment
  3. Check Outcome
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Data-Driven Science Process

A
  1. Take existing data
  2. Perform analysis
  3. Observe results
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Physics Equations
Historical Books/Artifacts
Earth Data
Biomedical Data
User Activity
User Content

A

Sources of Data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Is data often intended for scientific use?

A

No. Normally its original purpose is for an activity of different purpose. Eg: clinical practice, accounting, sharing knowledge, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Possible outcomes of a data science analysis

A
  1. Useful insight (can be published)
  2. Inform on further experiments to be conducted to verify a hypothesis
  3. Inform whether it is feasible to build a system that accurately predicts the data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does it mean that data is high-dimensional

A

Data cannot be plotted easily in two dimensions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How machine learning based dimensionality reduction works

A

The machine tries to recreate the original distances between points (x) from a high-dimensional space using points (y) in a lower-dimensional space. It adjusts the y points until the distance differences between the two dimensions’ point distances are nearly zero.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Statistics vs Machine Learning

A

Statistics makes assumptions about the nature of the data while machine learning addresses the data as it occurs in real world applications.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

ML for Data Science vs ML for Decision Making

A

ML for DS: generate presentable insights/hypotheses

ML for DM: focus on learning good autonomous decisions that can be taken for repetitive tasks without human intervention

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Data Science / ML lingo for row

A

Instance (can also say data point)

N means number of data points (instances)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

DS/ML lingo for column

A

Feature (can also say attribute)

D means number of features (attributes)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Metadata

A

Description of the data in the dataset. What it is actually about.
Eg: description of dataset, features and instances

“Data about the data”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Network Dataset Adjacency Matrix

A

Tabular structure of size NxN can be stored to represent Network Dataset. Checks for connection between each data point.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Large Datasets

A

Datasets whose size is too large to be processed with classical techniques (eg single computers, standard software). Must synchronize the model between different machines (computers)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Two level data analysis

A

First layer of data analysis may be performed on non-sensitive public data for the learning of the model. The data representations learned will then be applied to the second layer, private datasets for analysis to extract useful insights.