Interview PREP Flashcards

1
Q

What is the difference between supervised machine learning and unsupervised? Give examples.

A

Unsupervised machine learning is when you have a bunch of input values but no one particular output value, so basically you don’t exactly know what you’re looking for. It mainly deals with unlabeled data.

Supervised machine learning is when there is an associated response variable Yi and we try to find the relationship between predictors and the response variable. So linear regression,XGboost,etc. When I think of supervised machine learning, I think of inference and prediction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is inference and prediction?

A

Prediction models are models that care about the error and don’t really care as much about how we got there. When I think of the epitome of predictive models, I think of neural networks.
Inferential models dive deeper. In inferential modeling, you really want to see how the individual predictors affect your prediction. You’re more curious about the complex relationships in your model in this methodology.
An overall example would be looking at housing prices. When you’re trying to be predictive as possible, you really just care about your accuracy. In an inferential methodology , you care about things like “how does square feet affect my price”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is regression? Which models can you solve with regression?

A

Regression is a part of supervised ML that investigates the relationships between dependent values and independent values. You have linear regression, polynomial regression, Ridge regression ,and Lasso Regression.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is linear regression? When do we use it?

A

Linear regression models assume linear relationships between your dependent value and independent values.
Simple linear regression would be something like y=b0+b1*x
Multiple linear regression is when you have multiple independent values, so itd be something like y=b0+b1x+b2x2,etc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the main assumptions of linear regression?

A

Linear relationship
Multivariate normality-This assumption can best be checked with a histogram or a Q-Q-Plot. Normality can be checked with a goodness of fit test, e.g., the Kolmogorov-Smirnov test. When the data is not normally distributed a non-linear transformation (e.g., log-transformation) might fix this issue-assumes the residuals are normality distributed
No or little multicollinearity-When two predictors are independent
No auto-correlation of errors-residuals should be independent of eachother
Homoscedasticity-the size of the error term shouldn’t depend on the independent value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

what’s the normal distribution and why should we care about it?

A

The normal distribution is a continuous probability distribution where the mean mode and median are the same. We should care about it because it is very important to the central limit thereom which basically says that if you grab a large sample size, it should mirror a normal distribution. So if you look one std above the mean, you can assume that 16% of the population has a mean above that and then 2.5% 2 standard deviations away

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is gradient descent?

A

Gradient Descent is an optimization algorithm for finding a local minimum of a differentiable function. Gradient descent is simply used to find the values of a function’s parameters (coefficients) that minimize a cost function as far as possible. Imagine a blindfolded man who wants to climb to the top of a hill with the fewest steps along the way as possible. He might start climbing the hill by taking really big steps in the steepest direction, which he can do as long as he is not close to the top. As he comes closer to the top, however, his steps will get smaller and smaller to avoid overshooting it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is batch gradient descent?

A

Batch gradient descent, also called vanilla gradient descent, calculates the error for each example within the training dataset, but only after all training examples have been evaluated does the model get updated. This whole process is like a cycle and it’s called a training epoch.
Some advantages of batch gradient descent are its computational efficient, it produces a stable error gradient and a stable convergence. Some disadvantages are the stable error gradient can sometimes result in a state of convergence that isn’t the best the model can achieve. It also requires the entire training dataset be in memory and available to the algorithm.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is stochastic gradient descent?

A

By contrast, stochastic gradient descent (SGD) does this for each training example within the dataset, meaning it updates the parameters for each training example one by one. Depending on the problem, this can make SGD faster than batch gradient descent. One advantage is the frequent updates allow us to have a pretty detailed rate of improvement.

The frequent updates, however, are more computationally expensive than the batch gradient descent approach. Additionally, the frequency of those updates can result in noisy gradients, which may cause the error rate to jump around instead of slowly decreasing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is mini batch gradient descent?

A

Mini-batch gradient descent is the go-to method since it’s a combination of the concepts of SGD and batch gradient descent. It simply splits the training dataset into small batches and performs an update for each of those batches. This creates a balance between the robustness of stochastic gradient descent and the efficiency of batch gradient descent.

Common mini-batch sizes range between 50 and 256, but like any other machine learning technique, there is no clear rule because it varies for different applications. This is the go-to algorithm when training a neural network and it is the most common type of gradient descent within deep learning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

which metrics do you know for evaluating linear regression?

A
Mean Squared Error(MSE)
Root Mean Squared Error(RMSE)
Mean Absolute Error(MAE)
R² or Coefficient of Determination
Adjusted R²
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the bias-variance trade off?

A

Bias is the error introduced by approximating the true underlying function, which can be quite complex, by a simpler model(uder fitting)
. Variance is a model sensitivity to changes in the training dataset(overfitting)

Bias-variance trade-off is a relationship between the expected test error and the variance and the bias - both contribute to the level of the test error and ideally should be as small as possible:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is over fitting? What is underfitting?

A

But as a model complexity increases, the bias decreases and the variance increases which leads to overfitting. And vice versa, model simplification helps to decrease the variance but it increases the bias which leads to underfitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How to validate your models?

A

One of the most common approaches is splitting data into train, validation and test parts. Models are trained on train data, hyperparameters (for example early stopping) are selected based on the validation data, the final measurement is done on test dataset. Another approach is cross-validation: split dataset into K folds and each time train models on training folds and measure the performance on the validation folds. Also, you could combine these approaches: make a test/holdout dataset and do cross-validation on the rest of the data. The final quality is measured on the test dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Why do we need to split the data into train,validation,and test?

A

The training set is used to fit the model, i.e. to train the model with the data. The validation set is then used to provide an unbiased evaluation of a model while fine-tuning hyperparameters. This improves the generalization of the model. Finally, a test data set which the model has never “seen” before should be used for the final evaluation of the model. This allows for an unbiased evaluation of the model. The evaluation should never be performed on the same data that is used for training. Otherwise the model performance would not be representative.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How do you go about adding and removing variables in your model?

A

Forward Selection-you start with just the intercept and keep adding variables and checking the rss
Backward Selection-we start with all the variables in the model and keep removing variables starting with the largest p value until a stopping rule is reached(p value goes above a certain threshold)
Mixed Selection-Start with no models, then go with the forward and then we remove if the p value goes above a certain threshold until all the variables in the model have a low p value and the ones with a big p value aren’t there

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are some appraoches to validation?

A

Leave one out cross validation, k folds cross validation, and normal cross validation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is logistic regression and when is it used?

A

Logistic regression is a Machine Learning algorithm that is used for binary classification. You should use logistic regression when your Y variable takes only two values, e.g. True and False, “spam” and “not spam”, “churn” and “not churn” and so on. The variable is said to be a “binary” or “dichotomous”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is a sigmmoid function? what does it do?

A

A sigmoid function is a type of activation function, and more specifically defined as a squashing function. Squashing functions limit the output to a range between 0 and 1, making these functions useful in the prediction of probabilities.

Sigmod(x) = 1/(1+e^{-x})

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Is accuracy always a good metric?

A

Accuracy is not a good performance metric when there is imbalance in the dataset. For example, in binary classification with 95% of A class and 5% of B class, a constant prediction of A class would have an accuracy of 95%. In case of imbalance dataset, we need to choose Precision, recall, or F1 Score depending on the problem we are trying to solve.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is regularization and why do we need it?

A

Regularization is used to reduce overfitting in machine learning models. It helps the models to generalize well and make them robust to outliers and noise in the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Which regularization techniques do you know?

A
L1 Regularization (Lasso regularization) - Adds the sum of absolute values of the coefficients to the cost function- can shrink to 0 
L2 Regularization (Ridge regularization) - Adds the sum of squares of coefficients to the cost function-can shrink to close to 0 but never to 0

Where lambda determines the amount of regularization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

How does L2 regularization look like in a linear model?

A

L2 regularization adds a penalty term to our cost function which is equal to the sum of squares of models coefficients multiplied by a lambda hyperparameter. This technique makes sure that the coefficients are close to zero and is widely used in cases when we have a lot of features that might correlate with each other.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

How do we select the right regularizaiton parameters?

A

Regularization parameters can be chosen using a grid search, for example https://scikit-learn.org/stable/modules/linear_model.html has one formula for the implementing for regularization, alpha in the formula mentioned can be found by doing a RandomSearch or a GridSearch on a set of values and selecting the alpha which gives the least cross validation or validation error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Whats the different between l2 regularisation(ridge) and l1 (lasso)

A

Penalty terms: L1 regularization uses the sum of the absolute values of the weights, while L2 regularization uses the sum of the weights squared.
Feature selection: L1 performs feature selection by reducing the coefficients of some predictors to 0, while L2 does not.
Computational efficiency: L2 has an analytical solution, while L1 does not.
Multicollinearity: L2 addresses multicollinearity by constraining the coefficient norm.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What are decision trees?

A

This is a type of supervised learning algorithm that is mostly used for classification problems. Surprisingly, it works for both categorical and continuous dependent variables.

In this algorithm, we split the population into two or more homogeneous sets. This is done based on most significant attributes/ independent variables to make as distinct groups as possible.

A decision tree is a flowchart-like tree structure, where each internal node (non-leaf node) denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (or terminal node) holds a value for the target variabl

to quote from the elements of statistical learning”trees have one aspect that prevents them from being the ideal tool for predictive learning, namely inaccuracy”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What are main parameters of a decision tree?

A

maximum tree depth
minimum samples per leaf node
impurity criterion

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is random forest?

A

Random Forest is a machine learning method for regression and classification which is composed of many decision trees. Random Forest belongs to a larger class of ML algorithms called ensemble methods (in other words, it involves the combination of several models to solve a single prediction problem

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Explain how RF works

A

Step 1) create a bootstrapped dataset->to create a bootstrapped sample, we just randomly select samples from the original dataset->we can pick the same sample more than once
Step 2) create a decision tree using the bootstrapped dataset but only use a random subset of variables at each step->
step3) let’s say we select two variables->and we get good blood circulation as the best predictor-> we then make that a split and keep repeating
step4) go back to step 1 and repeat-> do this a lot of times
How do we use?
step5) run the data down the first tree->let’s say it predicts heart disease
Step 6) run the data down the second tree->lets say it also say yes-> keep going
Step 7) we see which option received more votes
bootstrapping the aggregate and using the aggregate is called bagging
Step 8)typically some of the data doesn’t end up in the dataset->we test our trees with that ->out of bag dataset
Step 9)accurate = proportion of out of bag that were correctly classified
Step 10) can do things like change the number of variables per step and choose the one which performs best

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Gradient boosting tree?

A

Gradient boosting works with first guess-initial guess-then builds a tree-builds fixed tree sizes based off previous errors-similiar to adaboost-it scales the trees -however it scales all trees by the same amount-builds another tree built off the tree before-keeps building until it has made the number of trees uve asked for or fit doesn’t get better-we are basically predicting the residuals-low bias high variance if we over fit -uses a learning rate to fight it-taking small steps results in better predictions on testing dataset-start with initial prediction then add first tree prediction and then add the second set prediction -each time we add a tree to the prediction-

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Explain XGboost tree

A

Designed to be used with large complicated datasets
Start out with prediction
Xgboost fits regression tree to the residuals like gradient descent-uses a unique regression tree-
Each tree starts out as a single leaf-all the residuals go to the leaf-calculate similarity score-sum of residuals squared divided by number of residuals + a lambda (regularization parameters)
Can we do a better job clustering the residuals?
Then calculate another similarity score for those residuals
When residuals are similar or just one- the similarity scores are large
We need to compare these new leafs with the older tree which we compare the gains - similarity score of leaf on left plus similarity score of leaf on right - leaf score

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What hyper parameter tuning do you know?

A

rid Search is an exhaustive approach such that for each hyper-parameter, the user needs to manually give a list of values for the algorithm to try. After these values are selected, grid search then evaluates the algorithm using each and every combination of hyper-parameters and returns the combination that gives the optimal result (i.e. lowest MAE). Because grid search evaluates the given algorithm using all combinations, it’s easy to see that this can be quite computationally expensive and can lead to sub-optimal results specifically since the user needs to specify specific values for these hyper-parameters, which is prone for error and requires domain knowledge.

Random Search is similar to grid search but differs in the sense that rather than specifying which values to try for each hyper-parameter, an upper and lower bound of values for each hyper-parameter is given instead. With uniform probability, random values within these bounds are then chosen and similarly, the best combination is returned to the user. Although this seems less intuitive, no domain knowledge is necessary and theoretically much more of the parameter space can be explored.
Bayesian processes-not exactly ure how ti works though

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What are the problems with sigmoid as an activation function?

A

The derivative of the sigmoid function for large positive or negative numbers is almost zero. From this comes the problem of vanishing gradient — during the backpropagation our net will not learn (or will learn drastically slow). One possible way to solve this problem is to use ReLU activation function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What is RELU? how is it better than sigmoid?

A

ReLU is an abbreviation for Rectified Linear Unit. It is an activation function which has the value 0 for all negative values and the value f(x) = x for all positive values. The ReLU has a simple activation function which makes it fast to compute and while the sigmoid and tanh activation functions saturate at higher values, the ReLU has a potentially infinite activation, which addresses the problem of vanishing gradients.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

reg techniques for NN?

A

L1 Regularization - Defined as the sum of absolute values of the individual parameters. The L1 penalty causes a subset of the weights to become zero, suggesting that the corresponding features may safely be discarded.
L2 Regularization - Defined as the sum of square of individual parameters. Often supported by regularization hyperparameter alpha. It results in weight decay.
Data Augmentation - This requires some fake data to be created as a part of training set.
Drop Out : This is most effective regularization technique for newral nets. Few randome nodes in each layer is deactivated in forward pass. This allows the algorithm to train on different set of nodes in each iterations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

What is the difference between Data Mining and Data Profiling?

A

Process of finding relevant information which has not been found before- way in which raw data is turned into valuable information-anything like web scraping/census data
Data profiling is usually done to assess a dataset for its uniqueness ,consistency and logic. Looking at it and saying “is it related to what im working with”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Define data wrangling in terms of data analytics

A

Data wrangling is the process of cleaning,structuring and enriching the raw data into a desirable usable format for better decision making

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What are the various steps involved in any analytics project?

A
Understand the problem
Data collection
Data cleaning
Data exploration and analysis
Interpret the results
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

What are the best practices for data cleaning?

A

80% in most analysis is in the cleaning
Make a data cleaning plan by understanding where common errors take place and keep communication
Identify and remove duplicates before working with the data
Focus on the accuracy on the data, maintain the value of types of data
Standardize the data at point of entry

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

How do you subset or filter data in SQL

A

Where and HAVING clause

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

What is the difference between where and having clause in sql

A

Where works on row data
The filter occurs before any groupings are made
Aggregate functions cant be used
Having works on aggregated
Having is used to filter values from a group
Aggregate functions can be used

42
Q

Write the python code to create an employees data from the emp.csv file and display the head and summary of ist

A

employees=pd.read_csv(“wherever it is/emp.csv”

employees. head()
employees. summary()

43
Q

How will you select the department and age columns from an employees dataframe

A

employees=pd.DataFrame(data)

Employees[[‘department’,’age’]]

44
Q

What is the criteria to say whether a developed data model is good or not?

A

Intuitive ,insightful and self explanatory
Should be easily confusmed by the client for actionable and profitable
Good model should easily to adapt to changes according to business requirements
If the dat gets updated, the model should be able to scale to the new data

45
Q

What are the different types of sampling techniques used by data analysts?

A

Used to select a subset of data from an entire data set to estimate the characteristics of the whole population
Random sampling
Systematic sampling-1/5/10/20
Cluster sampling-some of these things naturally group together (maybe by zip code)
Stratified sampling-(looking for shared things the group has like income)

46
Q

How are Union,intersect , and except used in sql

A

Takes both sql tables and takes it into one table
Intersect extracts common records between two tables
Except ->uncommon records ->two records that are not shared between the two

47
Q

Using the product_price table, write a sql query to find the record with the fourth market price

A

Select * from product price
Select top 1* from
(select top 4 from product_price order by mkt_price desc) as sp order by mkt mkt_price asc)

48
Q

Suppose there is an array num=np.array([[1,2,3],[4,5,6],[7,8,9]-extract the value 8 using 2d indexing

A

Num[2,1]

49
Q

how to find unique in pandas?

A

To find unique values in panda use .unique()
.nunique(number of unique values)
by_comb.describe()
by_comp.describe().transpose()

50
Q

What is DBMS?

A

A Database Management System (DBMS) is a program that controls creation, maintenance and use of a database. DBMS can be termed as File Manager that manages data in a database rather than saving it in file systems.

51
Q

What is RDBMS?

A

RDBMS stands for Relational Database Management System. RDBMS store the data into the collection of tables, which is related by common fields between the columns of the table. It also provides relational operators to manipulate the data stored into the tables.

52
Q

What is SQL?

A

SQL stands for Structured Query Language , and it is used to communicate with the Database. This is a standard language used to perform tasks such as retrieval, updation, insertion and deletion of data from a database.

53
Q

What are tables and Fields?

A

A table is a set of data that are organized in a model with Columns and Rows. Columns can be categorized as vertical, and Rows are horizontal. A table has specified number of column called fields but can have any number of rows which is called record.

Example:.

Table: Employee.

Field: Emp ID, Emp Name, Date of Birth.

Data: 201456, David, 11/15/1960.

54
Q

What is a primary key?

A

A primary key is a combination of fields which uniquely specify a row. This is a special kind of unique key, and it has implicit NOT NULL constraint. It means, Primary key values cannot be NULL.

55
Q

What is a unique key?

A

A Unique key constraint uniquely identified each record in the database. This provides uniqueness for the column or set of columns.Makes sure there aren’t repeats

A Primary key constraint has automatic unique constraint defined on it. But not, in the case of Unique Key.

There can be many unique constraint defined per table, but only one Primary key constraint defined per table.

56
Q

What is a foreign key?

A

A foreign key is one table which can be related to the primary key of another table. Relationship needs to be created between two tables by referencing foreign key with the primary key of another table.

57
Q

What is a join?

A

This is a keyword used to query data from more tables based on the relationship between the fields of the tables. Keys play a major role when JOINs are used.

58
Q

what is inner join?

A

Inner join return rows when there is at least one match of rows between the tables.

59
Q

what is right join?

A

Right join return rows which are common between the tables and all rows of Right hand side table. Simply, it returns all the rows from the right hand side table even though there are no matches in the left hand side table.

60
Q

what is left join?

A

Left join return rows which are common between the tables and all rows of Left hand side table. Simply, it returns all the rows from Left hand side table even though there are no matches in the Right hand side table.

61
Q

What is full join?

A

Full join return rows when there are matching rows in any one of the tables. This means, it returns all the rows from the left hand side table and all the rows from the right hand side table.

62
Q

What is normalization?

A

Normalization is the process of minimizing redundancy and dependency by organizing fields and table of a database. The main aim of Normalization is to add, delete or modify field that can be made in a single table.

63
Q

What is an index?

A

An index is performance tuning method of allowing faster retrieval of records from the table. An index creates an entry for each value and it will be faster to retrieve data.

64
Q

What are the three types of index?

A

unique,clustered,nonclustered

65
Q

Unique Index.

A

This indexing does not allow the field to have duplicate values if the column is unique indexed. Unique index can be applied automatically when primary key is defined.

66
Q

Clustered index?

A

This type of index reorders the physical order of the table and search based on the key values. Each table can have only one clustered index.

67
Q

Nonclustered indeX?

A

NonClustered Index does not alter the physical order of the table and maintains logical order of data. Each table can have 999 nonclustered indexes.

68
Q

What is a Cursor?

A

A database Cursor is a control which enables traversal over the rows or records in the table. This can be viewed as a pointer to one row in a set of rows. Cursor is very much useful for traversing such as retrieval, addition and removal of database records.

69
Q

What is a relationship and what are they?

A

Database Relationship is defined as the connection between the tables in a database. There are various data basing relationships, and they are as follows:.

One to One Relationship.
One to Many Relationship.
Many to One Relationship.
Self-Referencing Relationship.

70
Q

What is a query?

A

A DB query is a code written in order to get the information back from the database. Query can be designed in such a way that it matched with our expectation of the result set. Simply, a question to the Database.

71
Q

What is subquery?

A

A subquery is a query within another query. The outer query is called as main query, and inner query is called subquery. SubQuery is always executed first, and the result of subquery is passed on to the main query.

72
Q

What are the types of subquery?

A

There are two types of subquery – Correlated and Non-Correlated.

A correlated subquery cannot be considered as independent query, but it can refer the column in a table listed in the FROM the list of the main query.

A Non-Correlated sub query can be considered as independent query and the output of subquery are substituted in the main query.

73
Q

What is a stored procedure?

A

Stored Procedure is a function consists of many SQL statement to access the database system. Several SQL statements are consolidated into a stored procedure and execute them whenever and wherever required.

74
Q

What is a trigger?

A

A DB trigger is a code or programs that automatically execute with response to some event on a table or view in a database. Mainly, trigger helps to maintain the integrity of the database.

Example: When a new student is added to the student database, new records should be created in the related tables like Exam, Score and Attendance tables.

75
Q

What is the difference between DELETE and TRUNCATE commands?

A

DELETE command is used to remove rows from the table, and WHERE clause can be used for conditional set of parameters. Commit and Rollback can be performed after delete statement.

TRUNCATE removes all rows from the table. Truncate operation cannot be rolled back.

76
Q

What are local and global variables and their differences?

A

Local variables are the variables which can be used or exist inside the function. They are not known to the other functions and those variables cannot be referred or used. Variables can be created whenever that function is called.

Global variables are the variables which can be used or exist throughout the program. Same variable declared in global cannot be used in functions. Global variables cannot be created whenever that function is called.

77
Q

What is a constraint?

A

Constraint can be used to specify the limit on the data type of table. Constraint can be specified while creating or altering the table statement. Sample of constraint are.

NOT NULL.
CHECK.
DEFAULT.
UNIQUE.
PRIMARY KEY.
FOREIGN KEY.
78
Q

What is data Integrity?

A

Data Integrity defines the accuracy and consistency of data stored in a database. It can also define integrity constraints to enforce business rules on the data when it is entered into the application or database.

79
Q

What is Auto Increment?

A

Auto increment keyword allows the user to create a unique number to be generated when a new record is inserted into the table. AUTO INCREMENT keyword can be used in Oracle and IDENTITY keyword can be used in SQL SERVER.

Mostly this keyword can be used whenever PRIMARY KEY is used.

80
Q

What is the difference between Cluster and Non-Cluster Index?

A

Clustered index is used for easy retrieval of data from the database by altering the way that the records are stored. Database sorts out rows by the column which is set to be clustered index.

A nonclustered index does not alter the way it was stored but creates a complete separate object within the table. It point back to the original table rows after searching.

81
Q

What is Datawarehouse?

A

Datawarehouse is a central repository of data from multiple sources of information. Those data are consolidated, transformed and made available for the mining and online processing. Warehouse data have a subset of data called Data Marts.

82
Q

What is Self-Join?

A

Self-join is set to be query used to compare to itself. This is used to compare values in a column with other values in the same column in the same table. ALIAS ES can be used for the same table comparison.

83
Q

What is Cross-Join?

A

Cross join defines as Cartesian product where number of rows in the first table multiplied by number of rows in the second table. If suppose, WHERE clause is used in cross join then the query will work like an INNER JOIN.

84
Q

What is user defined functions?

A

User defined functions are the functions written to use that logic whenever required. It is not necessary to write the same logic several times. Instead, function can be called or executed whenever needed.

85
Q

What is collation?

A

Collation is defined as set of rules that determine how character data can be sorted and compared. This can be used to compare A and, other language characters and also depends on the width of the characters.

ASCII value can be used to compare these character data.

86
Q

What are all different types of collation sensitivity?

A

Case Sensitivity – A and a and B and b.
Accent Sensitivity.
Kana Sensitivity – Japanese Kana characters.
Width Sensitivity – Single byte character and double byte character.

87
Q

Advantages and Disadvantages of Stored Procedure?

A

Stored procedure can be used as a modular programming – means create once, store and call for several times whenever required. This supports faster execution instead of executing multiple queries. This reduces network traffic and provides better security to the data.

Disadvantage is that it can be executed only in the Database and utilizes more memory in the database server

88
Q

What is CLAUSE?

A

SQL clause is defined to limit the result set by providing condition to the query. This usually filters some rows from the whole set of records.

Example – Query that has WHERE condition

Query that has HAVING condition.

89
Q

What is Union, minus and Interact commands?

A

UNION operator is used to combine the results of two tables, and it eliminates duplicate rows from the tables.

MINUS operator is used to return rows from the first query but not from the second query. Matching records of first and second query and other rows from the first query will be displayed as a result set.

INTERSECT operator is used to return rows returned by both the queries.

90
Q

What is an ALIAS command?

A

ALIAS name can be given to a table or column. This alias name can be referred in WHERE clause to identify the table or column.

91
Q

What is the difference between TRUNCATE and DROP statements?

A

TRUNCATE removes all the rows from the table, and it cannot be rolled back. DROP command removes a table from the database and operation cannot be rolled back.

92
Q

What are aggregate and scalar functions?

A

Aggregate functions are used to evaluate mathematical calculation and return single values. This can be calculated from the columns in a table. Scalar functions return a single value based on the input value.

Example -.

Aggregate – max(), count - Calculated with respect to numeric.

Scalar – UCASE(), NOW() – Calculated with respect to strings.

93
Q

How can you create an empty table from an existing table?

A

Select * into studentcopy from student where 1=2

94
Q

How to fetch common records from two tables?

A

Select studentID from student INTERSECT Select StudentID from Exam

95
Q

How to select unique records from a table?

A

Select DISTINCT StudentID, StudentName from Student.

96
Q

What is the command used to fetch first 5 characters of the string?

A

Select SUBSTRING(StudentName,1,5) as studentname from student

97
Q

Which operator is used in query for pattern matching?

A

% - Matches zero or more characters.

_(Underscore) – Matching exactly one character.

98
Q

list different types of relationships in sql

A

One-to-One - This can be defined as the relationship between two tables where each record in one table is associated with the maximum of one record in the other table.
One-to-Many & Many-to-One - This is the most commonly used relationship where a record in a table is associated with multiple records in the other table.
Many-to-Many - This is used in cases when multiple instances on both sides are needed for defining a relationship.
Self Referencing Relationships - This is used when a table needs to define a relationship with itself.

99
Q

What are Aggregate and Scalar functions?

A

An aggregate function performs operations on a collection of values to return a single scalar value. Aggregate functions are often used with the GROUP BY and HAVING clauses of the SELECT statement. Following are the widely used SQL aggregate functions:

AVG() - Calculates the mean of a collection of values.
COUNT() - Counts the total number of records in a specific table or view.
MIN() - Calculates the minimum of a collection of values.
MAX() - Calculates the maximum of a collection of values.
SUM() - Calculates the sum of a collection of values.
FIRST() - Fetches the first element in a collection of values.
LAST() - Fetches the last element in a collection of values.
Note: All aggregate functions described above ignore NULL values except for the COUNT function.
A scalar function returns a single value based on the input value. Following are the widely used SQL scalar functions:

LEN() - Calculates the total length of the given field (column).
UCASE() - Converts a collection of string values to uppercase characters.
LCASE() - Converts a collection of string values to lowercase characters.
MID() - Extracts substrings from a collection of string values in a table.
CONCAT() - Concatenates two or more strings.
RAND() - Generates a random collection of numbers of given length.
ROUND() - Calculates the round off integer value for a numeric field (or decimal point values).
NOW() - Returns the current data & time.
FORMAT() - Sets the format to display a collection of values.

100
Q

What is the purpose of the group functions in SQL? Give some examples of group functions.

A

Group functions are necessary to get summary statistics of a data set. COUNT, MAX, MIN, AVG, SUM, and DISTINCT are all group functions.

101
Q

What does UNION do? What is the difference between UNION and UNION ALL?

A

“UNION removes duplicate records (where all columns in the results are the same), UNION ALL does not.” Read more here.