Interview PREP Flashcards

Question

Whats the different between l2 regularisation(ridge) and l1 (lasso)

Answer 1

Penalty terms: L1 regularization uses the sum of the absolute values of the weights, while L2 regularization uses the sum of the weights squared. Feature selection: L1 performs feature selection by reducing the coefficients of some predictors to 0, while L2 does not. Computational efficiency: L2 has an analytical solution, while L1 does not. Multicollinearity: L2 addresses multicollinearity by constraining the coefficient norm.

Answer 2

This is a type of supervised learning algorithm that is mostly used for classification problems. Surprisingly, it works for both categorical and continuous dependent variables. In this algorithm, we split the population into two or more homogeneous sets. This is done based on most significant attributes/ independent variables to make as distinct groups as possible. A decision tree is a flowchart-like tree structure, where each internal node (non-leaf node) denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (or terminal node) holds a value for the target variabl to quote from the elements of statistical learning"trees have one aspect that prevents them from being the ideal tool for predictive learning, namely inaccuracy"

Answer 3

maximum tree depth minimum samples per leaf node impurity criterion

Answer 4

Random Forest is a machine learning method for regression and classification which is composed of many decision trees. Random Forest belongs to a larger class of ML algorithms called ensemble methods (in other words, it involves the combination of several models to solve a single prediction problem

Answer 5

Step 1) create a bootstrapped dataset->to create a bootstrapped sample, we just randomly select samples from the original dataset->we can pick the same sample more than once Step 2) create a decision tree using the bootstrapped dataset but only use a random subset of variables at each step-> step3) let’s say we select two variables->and we get good blood circulation as the best predictor-> we then make that a split and keep repeating step4) go back to step 1 and repeat-> do this a lot of times How do we use? step5) run the data down the first tree->let’s say it predicts heart disease Step 6) run the data down the second tree->lets say it also say yes-> keep going Step 7) we see which option received more votes bootstrapping the aggregate and using the aggregate is called bagging Step 8)typically some of the data doesn’t end up in the dataset->we test our trees with that ->out of bag dataset Step 9)accurate = proportion of out of bag that were correctly classified Step 10) can do things like change the number of variables per step and choose the one which performs best

Answer 6

Gradient boosting works with first guess-initial guess-then builds a tree-builds fixed tree sizes based off previous errors-similiar to adaboost-it scales the trees -however it scales all trees by the same amount-builds another tree built off the tree before-keeps building until it has made the number of trees uve asked for or fit doesn’t get better-we are basically predicting the residuals-low bias high variance if we over fit -uses a learning rate to fight it-taking small steps results in better predictions on testing dataset-start with initial prediction then add first tree prediction and then add the second set prediction -each time we add a tree to the prediction-

Answer 7

Designed to be used with large complicated datasets Start out with prediction Xgboost fits regression tree to the residuals like gradient descent-uses a unique regression tree- Each tree starts out as a single leaf-all the residuals go to the leaf-calculate similarity score-sum of residuals squared divided by number of residuals + a lambda (regularization parameters) Can we do a better job clustering the residuals? Then calculate another similarity score for those residuals When residuals are similar or just one- the similarity scores are large We need to compare these new leafs with the older tree which we compare the gains - similarity score of leaf on left plus similarity score of leaf on right - leaf score

Answer 8

rid Search is an exhaustive approach such that for each hyper-parameter, the user needs to manually give a list of values for the algorithm to try. After these values are selected, grid search then evaluates the algorithm using each and every combination of hyper-parameters and returns the combination that gives the optimal result (i.e. lowest MAE). Because grid search evaluates the given algorithm using all combinations, it's easy to see that this can be quite computationally expensive and can lead to sub-optimal results specifically since the user needs to specify specific values for these hyper-parameters, which is prone for error and requires domain knowledge. Random Search is similar to grid search but differs in the sense that rather than specifying which values to try for each hyper-parameter, an upper and lower bound of values for each hyper-parameter is given instead. With uniform probability, random values within these bounds are then chosen and similarly, the best combination is returned to the user. Although this seems less intuitive, no domain knowledge is necessary and theoretically much more of the parameter space can be explored. Bayesian processes-not exactly ure how ti works though

Answer 9

The derivative of the sigmoid function for large positive or negative numbers is almost zero. From this comes the problem of vanishing gradient — during the backpropagation our net will not learn (or will learn drastically slow). One possible way to solve this problem is to use ReLU activation function.

Answer 10

ReLU is an abbreviation for Rectified Linear Unit. It is an activation function which has the value 0 for all negative values and the value f(x) = x for all positive values. The ReLU has a simple activation function which makes it fast to compute and while the sigmoid and tanh activation functions saturate at higher values, the ReLU has a potentially infinite activation, which addresses the problem of vanishing gradients.

Answer 11

L1 Regularization - Defined as the sum of absolute values of the individual parameters. The L1 penalty causes a subset of the weights to become zero, suggesting that the corresponding features may safely be discarded. L2 Regularization - Defined as the sum of square of individual parameters. Often supported by regularization hyperparameter alpha. It results in weight decay. Data Augmentation - This requires some fake data to be created as a part of training set. Drop Out : This is most effective regularization technique for newral nets. Few randome nodes in each layer is deactivated in forward pass. This allows the algorithm to train on different set of nodes in each iterations.

Answer 12

Process of finding relevant information which has not been found before- way in which raw data is turned into valuable information-anything like web scraping/census data Data profiling is usually done to assess a dataset for its uniqueness ,consistency and logic. Looking at it and saying “is it related to what im working with”

Answer 13

Data wrangling is the process of cleaning,structuring and enriching the raw data into a desirable usable format for better decision making

Answer 14

``` Understand the problem Data collection Data cleaning Data exploration and analysis Interpret the results ```

Answer 15

80% in most analysis is in the cleaning Make a data cleaning plan by understanding where common errors take place and keep communication Identify and remove duplicates before working with the data Focus on the accuracy on the data, maintain the value of types of data Standardize the data at point of entry

Answer 16

Where and HAVING clause

Answer 17

Where works on row data The filter occurs before any groupings are made Aggregate functions cant be used Having works on aggregated Having is used to filter values from a group Aggregate functions can be used

Answer 18

employees=pd.read_csv(“wherever it is/emp.csv” employees. head() employees. summary()

Answer 19

employees=pd.DataFrame(data) | Employees[[‘department’,’age’]]

Answer 20

Intuitive ,insightful and self explanatory Should be easily confusmed by the client for actionable and profitable Good model should easily to adapt to changes according to business requirements If the dat gets updated, the model should be able to scale to the new data

Answer 21

Used to select a subset of data from an entire data set to estimate the characteristics of the whole population Random sampling Systematic sampling-1/5/10/20 Cluster sampling-some of these things naturally group together (maybe by zip code) Stratified sampling-(looking for shared things the group has like income)

Answer 22

Takes both sql tables and takes it into one table Intersect extracts common records between two tables Except ->uncommon records ->two records that are not shared between the two

Answer 23

Select * from product price Select top 1* from (select top 4 from product_price order by mkt_price desc) as sp order by mkt mkt_price asc)

Answer 24

To find unique values in panda use .unique() .nunique(number of unique values) by_comb.describe() by_comp.describe().transpose()

Answer 25

A Database Management System (DBMS) is a program that controls creation, maintenance and use of a database. DBMS can be termed as File Manager that manages data in a database rather than saving it in file systems.

Answer 26

RDBMS stands for Relational Database Management System. RDBMS store the data into the collection of tables, which is related by common fields between the columns of the table. It also provides relational operators to manipulate the data stored into the tables.

Answer 27

SQL stands for Structured Query Language , and it is used to communicate with the Database. This is a standard language used to perform tasks such as retrieval, updation, insertion and deletion of data from a database.

Answer 28

A table is a set of data that are organized in a model with Columns and Rows. Columns can be categorized as vertical, and Rows are horizontal. A table has specified number of column called fields but can have any number of rows which is called record. Example:. Table: Employee. Field: Emp ID, Emp Name, Date of Birth. Data: 201456, David, 11/15/1960.

Answer 29

A primary key is a combination of fields which uniquely specify a row. This is a special kind of unique key, and it has implicit NOT NULL constraint. It means, Primary key values cannot be NULL.

Answer 30

A Unique key constraint uniquely identified each record in the database. This provides uniqueness for the column or set of columns.Makes sure there aren't repeats A Primary key constraint has automatic unique constraint defined on it. But not, in the case of Unique Key. There can be many unique constraint defined per table, but only one Primary key constraint defined per table.

Answer 31

A foreign key is one table which can be related to the primary key of another table. Relationship needs to be created between two tables by referencing foreign key with the primary key of another table.

Answer 32

This is a keyword used to query data from more tables based on the relationship between the fields of the tables. Keys play a major role when JOINs are used.

Answer 33

Inner join return rows when there is at least one match of rows between the tables.

Answer 34

Right join return rows which are common between the tables and all rows of Right hand side table. Simply, it returns all the rows from the right hand side table even though there are no matches in the left hand side table.

Answer 35

Left join return rows which are common between the tables and all rows of Left hand side table. Simply, it returns all the rows from Left hand side table even though there are no matches in the Right hand side table.

Answer 36

Full join return rows when there are matching rows in any one of the tables. This means, it returns all the rows from the left hand side table and all the rows from the right hand side table.

Answer 37

Normalization is the process of minimizing redundancy and dependency by organizing fields and table of a database. The main aim of Normalization is to add, delete or modify field that can be made in a single table.

Answer 38

An index is performance tuning method of allowing faster retrieval of records from the table. An index creates an entry for each value and it will be faster to retrieve data.

Answer 39

unique,clustered,nonclustered

Answer 40

This indexing does not allow the field to have duplicate values if the column is unique indexed. Unique index can be applied automatically when primary key is defined.

Answer 41

This type of index reorders the physical order of the table and search based on the key values. Each table can have only one clustered index.

Answer 42

NonClustered Index does not alter the physical order of the table and maintains logical order of data. Each table can have 999 nonclustered indexes.

Answer 43

A database Cursor is a control which enables traversal over the rows or records in the table. This can be viewed as a pointer to one row in a set of rows. Cursor is very much useful for traversing such as retrieval, addition and removal of database records.

Answer 44

Database Relationship is defined as the connection between the tables in a database. There are various data basing relationships, and they are as follows:. One to One Relationship. One to Many Relationship. Many to One Relationship. Self-Referencing Relationship.

Answer 45

A DB query is a code written in order to get the information back from the database. Query can be designed in such a way that it matched with our expectation of the result set. Simply, a question to the Database.

Answer 46

A subquery is a query within another query. The outer query is called as main query, and inner query is called subquery. SubQuery is always executed first, and the result of subquery is passed on to the main query.

Answer 47

There are two types of subquery – Correlated and Non-Correlated. A correlated subquery cannot be considered as independent query, but it can refer the column in a table listed in the FROM the list of the main query. A Non-Correlated sub query can be considered as independent query and the output of subquery are substituted in the main query.

Answer 48

Stored Procedure is a function consists of many SQL statement to access the database system. Several SQL statements are consolidated into a stored procedure and execute them whenever and wherever required.

Answer 49

A DB trigger is a code or programs that automatically execute with response to some event on a table or view in a database. Mainly, trigger helps to maintain the integrity of the database. Example: When a new student is added to the student database, new records should be created in the related tables like Exam, Score and Attendance tables.

Answer 50

DELETE command is used to remove rows from the table, and WHERE clause can be used for conditional set of parameters. Commit and Rollback can be performed after delete statement. TRUNCATE removes all rows from the table. Truncate operation cannot be rolled back.

Answer 51

Local variables are the variables which can be used or exist inside the function. They are not known to the other functions and those variables cannot be referred or used. Variables can be created whenever that function is called. Global variables are the variables which can be used or exist throughout the program. Same variable declared in global cannot be used in functions. Global variables cannot be created whenever that function is called.

Answer 52

Constraint can be used to specify the limit on the data type of table. Constraint can be specified while creating or altering the table statement. Sample of constraint are. ``` NOT NULL. CHECK. DEFAULT. UNIQUE. PRIMARY KEY. FOREIGN KEY. ```

Answer 53

Data Integrity defines the accuracy and consistency of data stored in a database. It can also define integrity constraints to enforce business rules on the data when it is entered into the application or database.

Answer 54

Auto increment keyword allows the user to create a unique number to be generated when a new record is inserted into the table. AUTO INCREMENT keyword can be used in Oracle and IDENTITY keyword can be used in SQL SERVER. Mostly this keyword can be used whenever PRIMARY KEY is used.

Answer 55

Clustered index is used for easy retrieval of data from the database by altering the way that the records are stored. Database sorts out rows by the column which is set to be clustered index. A nonclustered index does not alter the way it was stored but creates a complete separate object within the table. It point back to the original table rows after searching.

Answer 56

Datawarehouse is a central repository of data from multiple sources of information. Those data are consolidated, transformed and made available for the mining and online processing. Warehouse data have a subset of data called Data Marts.

Answer 57

Self-join is set to be query used to compare to itself. This is used to compare values in a column with other values in the same column in the same table. ALIAS ES can be used for the same table comparison.

Answer 58

Cross join defines as Cartesian product where number of rows in the first table multiplied by number of rows in the second table. If suppose, WHERE clause is used in cross join then the query will work like an INNER JOIN.

Answer 59

User defined functions are the functions written to use that logic whenever required. It is not necessary to write the same logic several times. Instead, function can be called or executed whenever needed.

Answer 60

Collation is defined as set of rules that determine how character data can be sorted and compared. This can be used to compare A and, other language characters and also depends on the width of the characters. ASCII value can be used to compare these character data.

Answer 61

Case Sensitivity – A and a and B and b. Accent Sensitivity. Kana Sensitivity – Japanese Kana characters. Width Sensitivity – Single byte character and double byte character.

Answer 62

Stored procedure can be used as a modular programming – means create once, store and call for several times whenever required. This supports faster execution instead of executing multiple queries. This reduces network traffic and provides better security to the data. Disadvantage is that it can be executed only in the Database and utilizes more memory in the database server

Answer 63

SQL clause is defined to limit the result set by providing condition to the query. This usually filters some rows from the whole set of records. Example – Query that has WHERE condition Query that has HAVING condition.

Answer 64

UNION operator is used to combine the results of two tables, and it eliminates duplicate rows from the tables. MINUS operator is used to return rows from the first query but not from the second query. Matching records of first and second query and other rows from the first query will be displayed as a result set. INTERSECT operator is used to return rows returned by both the queries.

Answer 65

ALIAS name can be given to a table or column. This alias name can be referred in WHERE clause to identify the table or column.

Answer 66

TRUNCATE removes all the rows from the table, and it cannot be rolled back. DROP command removes a table from the database and operation cannot be rolled back.

Answer 67

Aggregate functions are used to evaluate mathematical calculation and return single values. This can be calculated from the columns in a table. Scalar functions return a single value based on the input value. Example -. Aggregate – max(), count - Calculated with respect to numeric. Scalar – UCASE(), NOW() – Calculated with respect to strings.

Answer 68

Select * into studentcopy from student where 1=2

Answer 69

Select studentID from student INTERSECT Select StudentID from Exam

Answer 70

Select DISTINCT StudentID, StudentName from Student.

Answer 71

Select SUBSTRING(StudentName,1,5) as studentname from student

Answer 72

% - Matches zero or more characters. | _(Underscore) – Matching exactly one character.

Answer 73

One-to-One - This can be defined as the relationship between two tables where each record in one table is associated with the maximum of one record in the other table. One-to-Many & Many-to-One - This is the most commonly used relationship where a record in a table is associated with multiple records in the other table. Many-to-Many - This is used in cases when multiple instances on both sides are needed for defining a relationship. Self Referencing Relationships - This is used when a table needs to define a relationship with itself.

Answer 74

An aggregate function performs operations on a collection of values to return a single scalar value. Aggregate functions are often used with the GROUP BY and HAVING clauses of the SELECT statement. Following are the widely used SQL aggregate functions: AVG() - Calculates the mean of a collection of values. COUNT() - Counts the total number of records in a specific table or view. MIN() - Calculates the minimum of a collection of values. MAX() - Calculates the maximum of a collection of values. SUM() - Calculates the sum of a collection of values. FIRST() - Fetches the first element in a collection of values. LAST() - Fetches the last element in a collection of values. Note: All aggregate functions described above ignore NULL values except for the COUNT function. A scalar function returns a single value based on the input value. Following are the widely used SQL scalar functions: LEN() - Calculates the total length of the given field (column). UCASE() - Converts a collection of string values to uppercase characters. LCASE() - Converts a collection of string values to lowercase characters. MID() - Extracts substrings from a collection of string values in a table. CONCAT() - Concatenates two or more strings. RAND() - Generates a random collection of numbers of given length. ROUND() - Calculates the round off integer value for a numeric field (or decimal point values). NOW() - Returns the current data & time. FORMAT() - Sets the format to display a collection of values.

Answer 75

Group functions are necessary to get summary statistics of a data set. COUNT, MAX, MIN, AVG, SUM, and DISTINCT are all group functions.

Answer 76

“UNION removes duplicate records (where all columns in the results are the same), UNION ALL does not.” Read more here.