SQL важное Flashcards

Question

What is replication?

Answer 1

Replication is one of mechanisms of databases scaling. In case of replication, we copy our database, or part of our database to another place, and then use for certain purposes. So, what are different replication scenarios? First of all, this is so called Master-Slave or Primary-Secondary replication. In this case, we have one primary database, for serving insert, update, and delete operations, and few secondary databases, used only for reading data. This can be useful in case if we have to handle a huge amount of read requests, so this case these requests can be distributed between few nodes. Second variant is Master-Master or Primary-Primary replication. In this case, all of nodes of our database network can be responsible for handling both read and write requests, but possible miscommunication and mistakes caused by outdated data stored on them makes this approach not so effective, and usually used only for cases, when we need to have another database node in reserve to be able to use it if our main node will fail by any reason. Now, what about what these purposes can be? First of all, increase of data accessibility. Even if one or few nodes will fail, the system still will be able to server requests, so reliability increases. Second reason is database load reduction. If we have too many requests, this can be difficult to manage all of them on single server. So, if we use replication, we can distribute this load among multiple servers. Next is geodistribution. Maybe, some of our users are quite far away from main node. This case, we can use additional nodes to store copy of our data nearer to them, leading to faster response time. Finally, replication can be used in order to protect from data loss. Let's imagine, that we had a fire in our datacenter. This case, if we had one node, all our date is probably lost. But if we have few nodes, we can use them to backup our data. Also, talking about replication, replication can be synchronous and asynchronous. In synchronous replication, when our primary database receive write request, it will mark transaction complete only after it will also update all secondary databases too. In case of asynchronous replication, we don't wait until secondary nodes update. This can make operations faster, but can lead to temporary data inconsistancy.

Answer 2

- Scaling technique based on data splitting - Increase in performance, because operations on different shard can be performed in parallel - Data it distributed between shards based on sharding criteria - Ideal is 1 shard per op, or how sharding criteria can affect performance - Static sharding vs dynamic sharding Sharding is one of database horizontal scaling techniques, based on data splitting. The idea of sharding is to divide all this amount of data that we have in our database into smaller pieces, or shards, by on certain criteria, and distribute them between multiple servers, on order to reduce database overloading. Sharding can be useful for optimization purposes, because this case we can run some of our queries in parallel on each of shards, and through this make it faster, but this is also important to remember, that queries, built without taking the fact that table is sharded into account, can become significantly slower in case, if they involve big number of shards. Generally speaking, the less is the number of shards we need to serve each request, the better it is. Ideal situation is when we need to access only one shard to get required information. As simple example of my words, lets imagine that we have news portal, storing its articles in different shards. Depending on the criteria we used to split the data, its performance can be totally different. If we are not very clever, and splitted our enties based on article length, then probably we will have serious problems with performance, because every time we want to show visitor latest news, we have to go over absolutely all shards, because we have no idea in what shards these enties are placed. And the opposite situation, if our sharding criteria is article creation date, then our system will be quite fast, because we will need to access only one or few shards, where latest entries are placed. So, this is very important to understand, what criteria we will use for sharding. For news this can be date when they were created, for some other data, like user orders on marketplace, this can be user identifier. But if model is complex, then there could be more than one sharding key. For example, lets imagine we have a marketplace. So, we could need to shard our information by user ids, which is important for customers, and by order ids, which is important for logistic purposes. This case we inevitably will need to duplicate at least part of our data, and spend more disk space, but this improves performance. Well, now a few words about data distribution between shards. There are two ways we can do this - using fixed distribution, and dynamic distribution. In first case we use splitting function, what rely only on sharding key, like article creation date, or user id. For example, it can use simple modulo division or hash function. This is fast and simple distribution way, but this can cause problem in case if some of our shards can have different popularity, because this case we won't be able to balance the load between them. Also, if we would want to change number of shards in future, or just change distribution algorithm, that will be difficult task to do, because this case we will have to update quite a lot of data and re-distribute them among new shards. Another approach is dynamic distribution. This case information about exact place of our entry is stored in separate table, so every time we want to get certain entry, we go to this table and ask it, where it is located. This, of course, generally reduces performance, but makes it possible to dynamically move items between shards, balancing the load on each of them. Also, we can distribute load between shards unevenly - this can be useful if our servers have different performance. Do you have sharding and replication on your services? How big is the load?

Answer 3

Partitioning is a mechanism of splitting big amounts of data into smaller pieces. In partitioning, information from one or few tables, or from whole database is distributed among few servers. There are few types of partitioning. First one is vertical. This case we split our data based on their domain. For example, in case of online store, we can store information about users on one server, about orders on another, about products on third. But this approach also has limits, because if you have too many entries in one table, you will anyway have to use horizontal partitioning. About horizontal partitioning. In this case we split our table into shards by certain criteria, and save entries from different shards on different servers. Horizontal partitioning can be local too. This case we do this in order to make our database faster, because based on certain criteria, the amount of data database will have to interact with during certain query can be greatly reduced, and through this query can be executed faster. Also, you can have local indexes on separate partitions. This case, each of these indexes will work faster too.

Answer 4

MIN, MAX, AVG, SUM, COUNT

Answer 5

Well, the difference is in algorithm we use to join entries from different tables. In case of merge join, for example, we sort tables by their join keys and then we compare them row by row and if keys are similar, we join entries In case of nested loop join we go over all entries from first table and for each of them we check for possible joins in second table In case of hash join we create hash table using join key from one table and then use it to find matches with entries from other table

Answer 6

Normalization is needed in order to eliminate redundancy and duplication of data in the database.

Answer 7

Identity field is a field, value of what is generated automatically and incrementally following certain rules. For example, if we declare certain integer field with IDENTITY(1, 1) then we will have field, which initial value will be 1, but for each next entry inserted in table its value will be increased by 1. So, for first one it is 1, for second is 2, for third is 3 etc. Also, what is important, if you use TRUNCATE TABLE statement on table, then all identity fields counters will reset.

Answer 8

- Read Commited, allows for reading changes commited by other transactions. Non-repeatable reads and phantom reads are possible. - Repeatable Read, allows by reading changes commited by another transactions, but locks rows we read from changing by other transactions. This case we can avoid non-repeatable read, but still have chance of phantom read. - Serializable, guarantees protection agains any change of rows we interact with or can interact with during our transaction. Usually blocks whole table from being accessible for other transactions. Is completely safe, but makes system slow and not possible to perform multiple transactions in parallel Вот краткое сравнение: 🔹 **Repeatable Read**: * Основан на **MVCC** (многоверсионность). * Все запросы в транзакции видят **один и тот же снимок данных**. * Не видит изменений других транзакций после начала. * **Фантомные чтения возможны**, как и **write skew**. * Не гарантирует сериализуемость. 🔹 **Serializable**: * Использует **MVCC + SSI (Serializable Snapshot Isolation)**. * Анализирует зависимости между транзакциями (чтение → запись). * При угрозе нарушения сериализуемости — **откатывает транзакцию**. * Полностью имитирует **последовательное выполнение** транзакций.

Answer 9

Non-repeatable read is a situation when during single transaction same row is selected twice, but its columns values received first time are different from their values second time.

Answer 10

Phantom read is a situation when during single transactions we make same select query, but set of rows we receive in first case and in second are different.

Answer 11

Unique, not null

Answer 12

Using cluster index or heap. In case of heap entries are not organized and stored inside blocks called pages. In case of cluster index we use balanced binary tree index as structure base, but the difference between cluster index and default index is that in case of cluster index entries are stored together with index inside index nodes.

Answer 13

In PostgreSQL are many types of indexes: - Binary tree index - Hash index - GiST - Generalized Search Tree index - GIN - Generalized Inverted index First two are most frequently used for indexing absolutely different data, while two last are more specific. So, GiST is used for data with non-standart types, like geometry data, for example. GIN is used for full-text search, or search in arrays or binary json structures

Answer 14

Transaction is an SQL feature, that implements mechanisms that allow multiple queries work together like logically atomic operations. So, during transactions, you can use multiple queries, and if something goes wrong during their execution, then all changes that were made by your transaction will be deleted. Also, inside transactions you can use similar mechanism, implemented through use of SAVEPOINT and ROLLBACK TO statements. Using savepoint you can kind of save your progress that you have for this moment, and using ROLLBACK TO you can later return to table state that it had for the moment of reaching savepoint.

Answer 15

View are some kind of virtual tables, created on defined query basis. Two types of views exist. First is regular, which is just a wrapper over query we wrote for this view. Second is materialized. In case of materialized view its data is stored into separate table and not recalculated each time we make a request. The difference between views and regular tables is that views are computed values, what means that they can be read, but not changed. If you want to change something, you must go directly to the root table. Second thing, is that this is important to remember, that any queries with non-materialized views are slower than similar queries that get data from real tables. Finally, cause views are computed, if you change the tables they are based on, they can break.

Answer 16

The function in PostgreSQL is... is a function So, you set up function name, its input parameters with their names, its return type After that you write function body query and this query language. After that functions can be used same way like any other function in SQL This is also possible to take as parameters or return complex structures like tables or arrays, and take variable number of parameters.

Answer 17

Procedure is PostgreSQL is some kind of function, stored in the database, that can be reused. Procedure also has input parameters, but unlike function, it can't return any values, at least directly, and also should be called through using CALL statement. Also, procedure can't be called inside query Another big difference is that procedure can control transaction flow. If procedure was not called as part of existing transaction, it creates new transaction. If procedure was called inside existing transaction, it get some limitations, but still can control some aspects of transaction flow. For example, it won't be able to use COMMIT or ROLLBACK statements, but will be able to use SAVEPOINT and ROLLBACK TO statements. Also, as I already said, unlike functions, procedures can't directly return any values, but there is a feature that can let us avoid this limitation. We can declare one of function parameters using OUT or INOUT keyword before its name, and this case this parameter will be provided by reference, and through changing it inside procedure, it will also change outside of it.

Answer 18

Deadlock is a situation, when two or more transaction have circular lock dependency and prevent each other from doing anything because all of them are waiting for another to finish and remove its locks to continue working. As simplest example, let's imagine that we have two transactions being executed same time First transactions starts like: BEGIN; UPDATE table_1 SET column1 = 'value1' WHERE id = 3 Second transaction same time be like: BEGIN UPDATE table_2 SET column1 = 'value1' WHERE id = 3 And both of them did these operations during same period of time After that, first transaction tries to: UPDATE table_2 SET column1 = 'value2' WHERE id = 3 And second transaction be like: Update table_1 SET column1 = 'value2' WHERE id = 3 And again, they do this at same time Finally, both transactions put lock on tables 1 and 2 on same rows, and they are waiting for each other to finish, but this is not possible because they are in circular dependency. Finally, these transaction will finish anyway if they have timeout, or if PostgreSQL will notice them and force them to finish. Deadlocks are bad things because they waste time. Basically, there are three things we can do to avoid deadlocks. First is to make transactions lock resources in same order Second is to make transactions as short and quick as possible Third is to execute transactions with higher isolation level.

Answer 19

Filtered index is an index that was created using WHERE statement with certain condition. This index is created only for these entries that meet this condition and covers only part of table rows. Filtered indWex can be useful in case when we don't need to optimize whole database, and need only part of rows to be optimized. This case we can save some disk space without loses in performance. For example, if we have news portal, then we can have index covering only these articles that were created during last year, and ignore earlier articles, because they are anyway rarely read.

Answer 20

A primary key is a unique identifier for a record in a table. Field that is used as primary key has some restrictions. It must be unique, so no duplicates allowed, and also it must be not null.

Answer 21

Foreign key in SQL is a key that refer to field with primary key of another table. Usually it is used to connect two tables together and show that there is a dependency between them. Foreign key can have different behavior for cases when this primary key in another table is updated or entry containing primary key was deleted. There are three options: First one is to delete or correspondingly update this dependent entry. Second is to set foreign key to NULL. Last option is to restrict changes. This case it will forbid parent table entry to be deleted or updated.

Answer 22

Normalisation is the process of organising data in a database to reduce redundancy and improve convenience of working with data. It involves dividing larger tables into smaller tables and creating relationships between them.

Answer 23

A trigger is a special type of stored procedure automatically executed in response to certain database events. It is used to enforce business rules and to provide additional functionality.

Answer 24

A subquery is a query that is embedded within another query.

Answer 25

A cursor is a database object used to retrieve and manipulate data row by row. It is used when you need to perform complex operations that cannot be done with a single SQL statement.

Answer 26

A unique constraint ensures that each record in a table is unique, but it does not provide a unique identifier for the record. A primary key provides a unique identifier for the record and ensures each record is unique.

Answer 27

An aggregate function operates on a group of values and returns a single value, such as the sum or average of a column. A scalar function operates on a single value and returns a single value, such as the length or substring of a string.

Answer 28

The UNION operator is used to combine the results of two tables while also removing duplicate entries. The MINUS operator is used to return rows from the first query but not from the second query. The INTERSECT operator is used to combine the results of both queries into a single row. Before running either of the above SQL statements, certain requirements must be satisfied – Within the clause, each SELECT query must have the same number of columns. The data types in the columns must also be comparable. In each SELECT statement, the columns must be in the same order.

Answer 29

COALESCE command takes variable number of arguments, and returns first not null argument. This can be used instead of IFNULL, used in MYSql

Answer 30

One-to-One, when one entry in one table is related to one entry in enother table. For example, relation between employee and employee direct manager. One-to-Many, when one entry in one table is related to many entries in another table. For example, relation between person and person telephone numbers, because one person can have multiple telephone numbers. Many-to-Many, when for each entry from one table we have multiple related entries in second table, and vice versa. This case for connecting two tables is used third table, storing only connections between them. This can be, for example, case with students and courses. Each student can have multiple courses assigned, and each course can have multiple students.

Answer 31

WHERE is used to filter separate entries before grouping HAVING is used to filter groups of entries after grouping them, so this is possible to use aggregate functions inside HAVING block.

Answer 32

ACID is a set of properties that an ideal transaction should have. First is Atomicity. So, transaction is logically atomic operation, so it is completed successfully and this case affect the database contents, or it is not completed and this case all changes made inside this transaction are reverted. Second in Consistency. So, transaction moves database from one consistent state, where all databse rules and restrictions are met, to another consistent state. Third is isolation. Transaction results are not visible by other transaction before transaction has successfully finished, and also, parallel transaction never affect each other. Final is durability. If transaction was completed, its changes must be saved even in case of system failure, and also, if transaction was not completed, then it should not affect database state even in case of system failure. But, as I already said, these are properties of an ideal transaction, which never existed. In real world, some of these restrictions are sometimes violated. For example, isolation. In many databases, including Postgres, parallel transaction affect each other. They can block each other, and under some isolation levels, except of highest one, they can affect each other behavior, because one transaction can read data changed by another.

Answer 33

Trigger is a stored procedure, that is automatically called on certain event happened on certain table

Answer 34

SQL injection is a database hacking method, when hacker put SQL code into input data, and through this can perform queries inside database. SQL injection can lead to data loss or steal or corrpution. In order to avoid SQL injection this is important to sanitize the data you insert into your query. You can do this manually, or, what is better, use library or framework that you use to work with database to build query through use of prepared statements templates. Also, you can use data validation both on server and client side to ensure that any data that you receive has only expected types. Finally, everywhere where this is possible you must avoid using so called dynamic SQL, or in other words, situation, when you dynamically build query through concatenating SQL strings with input data.

Answer 35

Null in SQL is a special value, that means no data or unknown value. There are few things that is important to remember about NULL. First of all, NULL is not equal to empty string or zero, it is separate value. Second, NULL can't be compared with other values through equality operator. If we use it, we can get incorrect results. For example, equality like NULL equals NULL will return false as a result, which of course is not true. In order to check if value is NULL or not null, we must use IS NULL or IS NOT NULL statements. Finally, we must remember, that all operations with NULL return NULL.

Answer 36

We can use CASE operator to work with conditions. Its logic is a bit different from that we expect from it based on our experience from imperative languages like Golang, JavaScript or Python. In SQL, CASE operator works more like multiple "else if" statement. So, conditions are written inside CASE block, which starts with CASE operator and ends with END operator. Inside this block we define our conditions through WHEN and THEN statements, and we also can define certain default case that will be fulfilled if no other condition will be met through ELSE statement.

Answer 37

Window functions are functions that perform calculations over a set of rows, or in other words, window, associated with the current row and return a result for each row without grouping the data. Inside our window we can use aggregate functions like SUM, AVG or COUNT, ranking functions like ROW_NUMBER, RANK or DENSE_RANK and shift functions like LAG, LEAD or FIRST_VALUE, for example.

Answer 38

Sequental scan During sequental scan database reads all rows from table and check do they match condition or not. Sequental scan is usually used when we don't have index on column that we use in our where block or we expect that our query will get most of rows. Index scan During index scan database reads existing index in order to understand, what rows should be received from table, and then reads them from table, accessing them by row id. It is used when we have index and number of rows we want to get is relatively small. Index only scan Index only scan is similar to index scan, but database won't go to the table itself and will read all required information from index. This is possible to use if all columns we interact with are included in index or are part of it. Bitmap scan Bitmap scan is used when we need to get a lot of rows, but not a lot enough to use sequental scan. Bitmap scan uses index in order to understand, in what pages of table required rows can be stored, and later sequentally reads them in order to filter out required rows.

Answer 39

PGPool это штука, которая позволяет подключаться к базе данных с использование пула предсозданных соединений, а не создавать каждый раз для этого новые. При этом количество соединений регулируется как на стороне самой базы данных (общее максимальное количество подключений), так и в библиотеках для работы с пулами подключений отдельных сервисов (максимальное количество подключений от этого сервиса) Причины: 1. Создание нового соединения требует много ресурсов. 2. Базы данных не могут эффективно обслуживать больше определённого количества соединений 3. Для оптимизации процессов все соединения создаются заранее и помещаются в пул соединений, откуда запросы к базе данных их могут брать.

Answer 40

Локально запускали в Docker базу данных. С помощью make команд обновляли её данными с stg, потом локально работали.

Answer 41

У нас была настроена инфраструктура для шардирования, использовалась корпоративная либа, но по факту у нас был только один шард. А использовать либу и шардирование было обязательно, чтобы можно было в случае необходимости по-быстрому расшардировать сервис.

SQL важное Flashcards

(65 cards)