SQL важное Flashcards
How to create table in SQL?
This can be done using CREATE TABLE command. After CREATE TABLE words you must provide new table name, and after than declare table structure or use keyword AS in order to create new table using SELECT request.
Also, this is possible to write TEMPORARY keyword before TABLE keyword in order to create temporary table.
How to delete table in SQL?
Using DROP TABLE command. After this command you have to provide table name.
How to change SQL table structure?
This is possible to change table structure using ALTER TABLE command.
In this command you can add, drop and alter existing columns, foreign keys and constraints.
What is difference between regular tables and temporary tables in SQL?
Regular tables are permanent, but temporary tables exist only inside certain database session, and will be deleted when session ends.
What are constraints in SQL?
Constraints in SQL are certain rules what we can apply to certain columns in order to prevent unwanted behaviour. For example, these constraints can be check, unique, default or foreign key. Each of them adds certain behaviour scenarios what will be used in certain conditions. For example, using CHECK constaint we can add conditions what certain value should meet before being inserted into table, or using DEFAULT we can provide certain default value what will be inserted into certain column if value for this column was not provided.
What is primary key?
Primary key is unique id which can be used to identify certain entry.
What is foreign key?
Foreign key is a value which references to certain entry in another table. This can be used to logically connect few tables.
What is difference between local temporary and global temporary SQL tables.
Local tables exist during single session and accessible only inside this session.
Global tables exist while exist any session that uses it, and accessible from all sessions.
What is table normalization?
Table normalization is a process of splitting table into multiple subtables, keeping same logic. This sometimes can be necessary in order to simplify or optimize table structure.
For example, let’s imagine that we have table named Product. Table Product has few columns, which are id, name and category. Name and category are just text fields. The problem this case is that first of all, we have to write category in each entry, and waste more memory. Also, your categories can have mismatch, even if they should mean same things. For example, some categories can be written in only lower case, without capital letter, or something else. So that would be a problem to process this data.
In order to solve this problem, we can move categories into separate table, and instead of providing category name in product item, we can provide only category id. So, this will simplify working process.
What are JOINs in SQL?
JOIN is a SQL language mechanism, what allows us to connect two tables together inside query, and then access this resulting table during request
How INNER JOIN works?
In case of inner join in resulting table will remain only these entries from original tables for what we have matching values in both tables.
How LEFT JOIN works?
Lets imagine that we have two tables, A and B.
In case of LEFT JOIN of table B on table A, in resulting table we will have all data for entires from table A, but we will have data from table B only on these entries, where our join condition matches
How RIGHT JOIN works?
Lets imagine that we have two tables, A and B.
In case of RIGHT JOIN of table B on table A, in resulting table we will have all data for entires from table B, but we will have data from table A only on these entries, where our join condition matches
How FULL JOIN works?
Lets imagine that we have two tables, A and B.
In case of FULL JOIN of table B on table A, in resulting table we will have all data for entires from tables A and B, but for each entry in resulting table we will have full data only for these entries where our join condition matches. In other case, on certain entries we can have no information about part belonging to second table.
How SELF JOIN works?
SELF JOIN is a name for tables joining case, when we connect certain table with itself. This can be used in case when table references to itself
For example, imagine that we have table of employees, and for each employee we have id, name, and this employee’s boss name. So, if we want to get employee id, name, and boss name, we can use SELF JOIN.
How CROSS JOIN works and where it can be used?
CROSS JOIN is a specific kind of table joining, when we join each row of first table with each row of second table.
About where it can be used…
Okay, just as simplest example, lets imagine that we are selling T-shirts, and we have tables with T-shirts sizes and, for example, materials they can be made of. And if we want to have table, where we store each possible combination of T-shirt size and material. So for this purpose we can use CROSS JOIN
What is SQL index and how it works?
Index in SQL is a special data structure, what can be used in order to reduce queries time when you make request to certain columns in table.
By default, we have no indexes on colunms, except of id columns, and if you for want to select only certain entries from table, and make request like select name from person where age equals 20, then SQL will have to go through all entries, and compare each of them with your condition.
This is okay if your table is small, but what if it contains something like million, or ten millions entries? This case, of course, your request will be quite slow, because you have to do quite a lot of work to compare all these values to find only matching.
So, indexes are structures what SQL can use in order to optimize this process. Usually index is just a pre-computed binary tree or hash based on values of certain columns of our table, and this tree is used to reduce number of steps we have to do to find required entries.
But when you are trying to optimize something using indexes, you always have to remember, that indexes is not something magical, that just boosts performance with no side effects, but they have direct impact on other aspects of your database.
First of all, indexes are blocks of additional data about your table, and they have to be stored somewhere. So, if you have indexes, then you need more disk space than without them.
Also, indexes are based on the idea that all entries of tables are included in them, so every time you insert new entries, or deleting or updating old entries, indexes have to be updated too. This, of course, makes these operations slower, and sometimes if work with table where you have a lot of operations, what update it, but much smaller number reading requests, then best optimization could be to delete certain index.
How do multicolumn indexes work in SQL?
Multicolumn indexes are optimized for operations what filter entries first by first index column, then by second etc.
But this is important to remember that for best performance we have to filter entries using all these columns that are used for index.
So, for example, if we have table with columns A, B and C, and then create multicolumn index by A, B, C, then best performance we will get in case of queries filtering entries by all of these columns.
If we have queries, filtering only by two or one of them, then performance can be lower.
For example, if we filter only by A, performance will be almost similar to single-column index by A.
If we filter only by B, or by B and C, then performance will be significantly lower, because this case database will have to read more data from index before filtering.
And worst performance we will get if we try to filter only by column C. But, this still can be better than completely without index.
What is selectivity in SQL?
Selectivity in SQL is relation of number of distinct values inside column in a table to number of all entries in table.
Index selectivity varies in border from zero to one, where zero means that we have only one distinct value among entries, and selectivity of one means that every value in our table is unique.
The higher is table index selectivity, the better it is, because this case database works more effectively when using this index. Otherwise, if selectivity is low, for example, zero, or 0.5, then this can be better to completely avoid using index while performing request, because this case time what will be needed to get required information from index will be bigger, than time required to just directly scan all entries.
What is cluster index in SQL?
Cluster index in SQL is a specific approach to storing data. This case, we store both index and entries themselves inside one structure, instead of storing only identifiers and accessing entries by reference using them.
There can be only one cluster index in table, because this is not possible to store same entry in two different places in two different cluster indexes
What is covering index in SQL?
Covering index is index, what already has all information, required for performing certain request, so there are no need to access table itself this case.
But, covering indexes usually can be used only with very simple requests, or in cases when index was created in order to optimize one certain request.
How to optimize SQL query?
First of all, we can use indexes.
If we have a lot of select queries, and even better if we have a lot of similar select queries, then we can create index or indexes for columns that are used in them, and through this significantly reduce time required for filtering entries.
Second thing we can do is to analyze our requests. Are they optimal? Don’t we do something like recreation of non-materialized table on every iteration of subquery?
Or maybe we select too many columns inside subqueries, and don’t use them later? Or maybe our joins are ineffective and we use left join instead of inner joins?
Third, we should check, do we need all indexes we have in our table. If we have too many indexes that are rarely used, then sometimes this can be better to get rid of them, because they make database slower when inserting, updating or deleting rows.
Finally, maybe we do some expensive calculations in our queries, that could be avoided? For example, we need to select entries created during certain year, and we compare entry year with aim year using YEAR function. This case we can improve performance by using BETWEEN statement, or at least LIKE, directly comparing date strings without additional calculations.
What value types exist in SQL? (PostgresSQL)
- Integers, which are integer, bigint, bigserial or smallint
- Floats, like real or double
- Fixed length strings
- Variable length strings
- Dates and timestamps
- UUID
- Boolean
And some other more specific types
What are differences between PostgreSQL and other databases?
Я бы сказал, что PostgreSQL - это более громоздкое, но более универсальное и надёжное решение для больших систем.
Например, в PostgreSQL есть очень большое количество поддерживаемых типов. Это не только дефолтные строки, числа или ещё что-то, но и JSON, XML, геометрические данные, да ещё и свои типы данных тоже можно добавлять.
Кроме того, PostgreSQL также сильно большее количество поддерживаемых индексов, чем в том же MySQL.
Кроме пользовательских типов, в PostgreSQL можно также добавлять свои процедуры и функции.
Ещё, насколько я помню, PostgreSQL лучше работает с параллельными запросами, и в целом лучше оптимизирует запросы, но деталей я сейчас не вспомню.
Well, first of all, expandability. That means that this is possible to add your own data types or functions in Postgres
Third is big number of supported data types, including for example special data types for working with arrays, json, xml or geometry types
Also, PostgreSQL is quite scalable and support replication and sharding.
What is replication?
Replication is one of mechanisms of databases scaling.
In case of replication, we copy our database, or part of our database to another place, and then use for certain purposes.
So, what are different replication scenarios?
First of all, this is so called Master-Slave or Primary-Secondary replication. In this case, we have one primary database, for serving insert, update, and delete operations, and few secondary databases, used only for reading data. This can be useful in case if we have to handle a huge amount of read requests, so this case these requests can be distributed between few nodes.
Second variant is Master-Master or Primary-Primary replication. In this case, all of nodes of our database network can be responsible for handling both read and write requests, but possible miscommunication and mistakes caused by outdated data stored on them makes this approach not so effective, and usually used only for cases, when we need to have another database node in reserve to be able to use it if our main node will fail by any reason.
Now, what about what these purposes can be?
First of all, increase of data accessibility. Even if one or few nodes will fail, the system still will be able to server requests, so reliability increases.
Second reason is database load reduction. If we have too many requests, this can be difficult to manage all of them on single server. So, if we use replication, we can distribute this load among multiple servers.
Next is geodistribution. Maybe, some of our users are quite far away from main node. This case, we can use additional nodes to store copy of our data nearer to them, leading to faster response time.
Finally, replication can be used in order to protect from data loss. Let’s imagine, that we had a fire in our datacenter. This case, if we had one node, all our date is probably lost. But if we have few nodes, we can use them to backup our data.
Also, talking about replication, replication can be synchronous and asynchronous.
In synchronous replication, when our primary database receive write request, it will mark transaction complete only after it will also update all secondary databases too.
In case of asynchronous replication, we don’t wait until secondary nodes update. This can make operations faster, but can lead to temporary data inconsistancy.
What is sharding?
Sharding is one of database horizontal scaling techniques, based on data splitting.
The idea of sharding is to divide all this amount of data that we have in our database into smaller pieces, or shards, by on certain criteria, and distribute them between multiple servers, on order to reduce database overloading.
Sharding can be useful for optimization purposes, because this case we can run some of our queries in parallel on each of shards, and through this make it faster, but this is also important to remember, that queries, built without taking the fact that table is sharded into account, can become significantly slower in case, if they involve big number of shards.
Generally speaking, the less is the number of shards we need to serve each request, the better it is. Ideal situation is when we need to access only one shard to get required information.
As simple example of my words, lets imagine that we have news portal, storing its articles in different shards. Depending on the criteria we used to split the data, its performance can be totally different.
If we are not very clever, and splitted our enties based on article length, then probably we will have serious problems with performance, because every time we want to show visitor latest news, we have to go over absolutely all shards, because we have no idea in what shards these enties are placed. And the opposite situation, if our sharding criteria is article creation date, then our system will be quite fast, because we will need to access only one or few shards, where latest entries are placed.
So, this is very important to understand, what criteria we will use for sharding. For news this can be date when they were created, for some other data, like user orders on marketplace, this can be user identifier.
But if model is complex, then there could be more than one sharding key.
For example, lets imagine we have a marketplace. So, we could need to shard our information by user ids, which is important for customers, and by order ids, which is important for logistic purposes. This case we inevitably will need to duplicate at least part of our data, and spend more disk space, but this improves performance.
Well, now a few words about data distribution between shards. There are two ways we can do this - using fixed distribution, and dynamic distribution.
In first case we use splitting function, what rely only on sharding key, like article creation date, or user id.
For example, it can use simple modulo division or hash function. This is fast and simple distribution way, but this can cause problem in case if some of our shards can have different popularity, because this case we won’t be able to balance the load between them. Also, if we would want to change number of shards in future, or just change distribution algorithm, that will be difficult task to do, because this case we will have to update quite a lot of data and re-distribute them among new shards.
Another approach is dynamic distribution. This case information about exact place of our entry is stored in separate table, so every time we want to get certain entry, we go to this table and ask it, where it is located.
This, of course, generally reduces performance, but makes it possible to dynamically move items between shards, balancing the load on each of them. Also, we can distribute load between shards unevenly - this can be useful if our servers have different performance.