Intro to DataScience Flashcards

1
Q

What is Data Science?

A

Application of computational and statistical techniques on data to gain insight

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the difference between data and information

A

Data is unusable until organized, however information in the result of processed data when put into context

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Explain what computer programming is

A

Create a sequence of instruction capable of automating a system to performing specific task

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the main features of using IDE?

A

Availability of tools to test and debug

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What does Jupyter Notebook stand for?

A

Julia, Python, and R

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Whats the difference between structured and ustructured data?

A

Structured data is organized with predetermined set of rules.

Unstructured data is sets of data where it is difficult to determined predetermined sets of rules to organize

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

XML vs JSON, which takes less storage? and what do they stand for?

A

JSON, as they don’t use end tags

Java Script Object Notation
Extensible Markup Language

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the properties of a list, tuple, sets, and dictionaries

A

Lists: Ordered, changeable, allow duplicate

Tuple: Ordered, unchangeable, allow duplicate

Set: Unordered, Unchangeable, Unindexed, no duplicate (Unchangeable but you can add
or remove items)

Dictionary: Unordered, changeable, no duplicate (Ordered as of python 3.7, patch 3.6 and earlier stil consider it to be unordered)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Types of data structures that can contain different datatypes

A

Lists and Dataframes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How to access components of a list?

A

Using the $ sign or [[ ]]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How to find the length of a string in R?

A

nchar()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a database

A

Collection of data stored in a computer system

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What does a DBMS allow us to do?

A

Store, Query, Update, Manage, Control access to

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Advantage of using a DBMS

A

Store massive amounts of data

Access to multiple users

Concurrency

Efficient Manipulation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are iterations?

A

Command to order the computer to run the same commands repeatedly

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the 3 main types of iterations?

A

for, while, repeat

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Can you rewrite a for loop with a while loop?

A

Yes, it also works vice versa

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Can you rewrite a while loop with repeat?

A

Yes, but the converse is not true

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is debugging?

A

Task of fixing problems in our code

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

State 3 Condition handling tools

A

withCallingHandlers()

tryCatch()

try()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

State 3 debugging tools

A

traceback()
options()
browser()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is defensive programming

A

Strategy of making a code fail in a well defined manner

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

State the Fail-fast principle

A

Avoid Functions with non-standard evaluation

Avoid Functions that return different output based on the input

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What objects are mutable?

A

List and Dictionaries

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Does aliasing work the same way in both Python and R?

A

R uses a copy-on-modify strategy, while Python rewrites the original copy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What is the difference between nums.sort() and sorted(nums)

A

nums.sort() alters the original list into a sorted list

sorted(nums) only display the sorted version of the list without storing it to any variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

In OOP, what is a class?

A

Type of object that allows us to bundle data and functionality together

Attributes are attached to maintain the state, and methods to modify

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

In OOP, What is encapsulation?

A

Bundling data and methods to restrict direct access of data to object

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

In OOP, What is inheritance

A

A child class is based on the parent class and has access to methods from the parent class

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

In OOP, what is polymorphism?

A

Methods in child class to behave differently from the parent class

31
Q

What is data wrangling?

A

Implementation and design process to turn unstructured data for analytical process

32
Q

What is the use of apply() function?

A

Manipulate data repeatedly without the use of writing loops

33
Q

What function do we use to convert wide data to long format? and vice versa

A

wide to long : melt() in R or *.melt() and pivot_longer() in python
long to wide: acast() and dcast() in R and pivot_table() in python

34
Q

What is the difference between .loc[ ] and .iloc [ ]

A

.loc[ ] is a label-based way to get the specific values

.iloc[ ] is an index-based way to get the specific values

35
Q

How to handle missing values in Python?

A

.dropna()

.fillna()

.interpolate()

36
Q

When using a negative integer for subsetting, what is the difference in their function in R and Python?

A

In R, we remove the values on the selected index

In Python, the negative integer calls the value counted from the furthest right

37
Q

What is the Work flow of DS project

A
  1. Import and wrangle data
  2. EDA and visualize
  3. Consider several models and separate signal from noise
  4. Compare models and inform future decisions
38
Q

How do you visualize distributions?

A

Box plots, Violin Plots, Histogram, Density Plots (Kernel and ridgeline)

39
Q

What graph do we use to explore the association between two continuous variables?

A

Scatter plot

40
Q

What is the use of histogram and Kernel Density plot?

A

Provide information for the distribution of continuous variables

41
Q

Difference between Violin Plot and Kernel Density plot?

A

Violin plot: empirical density of continuous var across categories of other vars

KDP: provides empirical density of single variable

42
Q

is ggplot2 the grammar of graphics? if not, why?

A

ggplot2 evolved from the ideas of the grammar of graphics

43
Q

Difference between facet_grid and facet_wrap

A

Facet_grid plots all available plots

Facet-wrap only display plots with actual values

44
Q

What library do we use to visualize networks in python?

A

networkx package

45
Q

What is a machine learning model

A

Algorithm that inputs data and outputs prediction based on parameters

46
Q

What is the aim of machine learning?

A

The primary aim is to achieve a low test error ideally (but not necessarily) with a low training as well.

47
Q

What is the typical sequence in machine learning?

A
Data Collection
Data Wrangling
Model Building
Model Evaluation
Model saving & testing
48
Q

How to impute missing values?

A

Impute mean, median, mode
Impute value from the observation
Sample-based on histogram
Remove cases with missing values

49
Q

Difference between supervised and unsupervised machine learning?

A

Supervised ML: We have a single variable as target/response variable to predict

Unsupervised ML: No single target/response variable

50
Q

ML models for Regression

A

Linear, lasso, ridge regression

51
Q

ML models for classification

A

Logistic regression
Penalised logistic regression
Support vector machine (svm)

52
Q

ML models that work for any type of response variable

A
Random forest
Gradient boosting
Decision trees
Gaussian process
Neural networks
53
Q

How do we evaluate regression tasks?

A

Mean Squared error (MSE)

54
Q

Why do we use the Cross-validation method

A

To handle cases with a fortunate or unfortunate splits that happens by chance

55
Q

Explain what tuning hyperparameters is

A

The process to determine optimal values of a parameter via cross-validation to accurately predict an outcome

56
Q

What is sensitivity and specificity

A

Sensitivity is the rate of true positives

Specificity is the rate of true negatives

57
Q

How to evaluate classification task in ML

A

Using ROC curve and calculate area under the curve to determine the likelihood

58
Q

What is the ROC Curve

A

A chart that illustrates the quality of a classification method by plotting its sensitivity (on the y-axis) vs the 1-specificity (x-axis) for a varying range of thresholds.

59
Q

What SDLC model has no feature of revisiting previous versions?

A

Waterfall Model

60
Q

What are the features of an iterative model

A

Rigid, but insights are gained from earlier iterations

61
Q

What SDLC model should we use for a fast and flexible workframe?

A

Agile model

62
Q

Explain the features of a V-shaped model

A

A testing phase is done before each implementation of the development phase

63
Q

Explain what the DevOps model is

A

Most recent model

Software devs and engineer work from development until interaction with customers

64
Q

Explain what a spiral SDLC model is

A

Combination of iterative and waterfall

Most flexible as it can adopt multiple models based on risk patterns

65
Q

State all types of software testing

A

Unit testing
Integration testing
Acceptance testing
System testing

66
Q

Can you write down unit tests before writing the code itself?

A

Yes

67
Q

Should you write a documentation in distinct files from the main script?

A

Yes

68
Q

Should you write a documentation in distinct files from the main script?

A

Yes

69
Q

Advantage of writing down codes in the form of packages

A

Easy to share with other developers

Modular structure ease-out debugging process

70
Q

Typical Sequence of Git from creation of local repository to storing them in remote repository

A
  1. Git init
  2. Git add
  3. Git Status
  4. Git Commit
  5. Git Remote
  6. Git Push
71
Q

What is the use for git checkout

A

Switch or change into a new branch

72
Q

Explain what git diff does

A

Spots the difference between commits and working trees

73
Q

What does git merge do?

A

Combine multiple sequences of commits into one unified history.

74
Q

What is the difference between git clone and git pull?

A

Git clone: Creates a local copy from the remote repository

Git pull: fetch and download content from remote to local repository to match the content