DBs and SQL Flashcards
What is a SQL database?
- Structured Data: SQL databases, also known as relational databases, store data in structured tables with rows and columns, where each row represents a record and each column represents a field.
- Schema: SQL databases have a fixed schema, which defines the structure of the tables, including the data types and constraints for each column.
- ACID Transactions: SQL databases support ACID (Atomicity, Consistency, Isolation, Durability) transactions, ensuring data integrity and reliability.
- Query Language: SQL databases use the Structured Query Language (SQL) to interact with the data, allowing for complex queries, joins, and aggregations.
- Examples: Popular SQL databases include MySQL, PostgreSQL, Oracle Database, and Microsoft SQL Server.
What is a NoSQL database?
- Flexible Schema: NoSQL databases do not require a fixed schema, allowing for more flexibility in data storage and accommodating diverse data structures.
- Variety of Data Models: NoSQL databases can use various data models, including document, key-value, column-family, and graph.
- Scalability: NoSQL databases are designed for horizontal scalability, making them suitable for handling large volumes of data across multiple servers.
- CAP Theorem: NoSQL databases often prioritize availability and partition tolerance over consistency, as described by the CAP theorem.
- Examples: Popular NoSQL databases include MongoDB (document), Redis (key-value), Cassandra (column-family), and Neo4j (graph).
When should you use a SQL vs a NoSQL database?
- Structured Data: Use SQL databases when dealing with structured data that fits well into tables with a fixed schema.
- Complex Queries: Choose SQL databases when you need to perform complex queries, joins, and aggregations on your data.
- Data Integrity: Opt for SQL databases when data integrity and ACID transactions are crucial for your application.
- Flexible Schema: Use NoSQL databases when you have diverse or rapidly changing data structures that don’t fit well into a fixed schema.
- Scalability: Choose NoSQL databases when you need to handle large volumes of data and require horizontal scalability.
Additional interview questions around these topics:
How do you handle data consistency in a NoSQL database? What are the trade-offs between consistency, availability, and partition tolerance in the CAP theorem? How do you design a schema for a relational database? What are the advantages and disadvantages of using a document-based NoSQL database? How do you handle data normalization and denormalization in a SQL database? What are some strategies for optimizing query performance in a SQL database? How do you handle data migration when transitioning from a SQL to a NoSQL database or vice versa?
How do you design a schema for a relational database?
- Identify Entities: Determine the main entities (objects or concepts) that need to be represented in the database, such as customers, orders, or products.
- Define Relationships: Establish the relationships between entities, including one-to-one, one-to-many, and many-to-many relationships.
- Normalize Data: Organizing data into separate tables to minimize redundancy and ensure data integrity based on logical relationships.
- Choose Primary and Foreign Keys: Select primary keys to uniquely identify records in each table and use foreign keys to establish relationships between tables.
- Determine Data Types: Define the appropriate data types for each column, such as integer, varchar, or date, considering the nature of the data and any constraints.
- Consider Indexes: Create indexes on columns that are frequently used in queries to improve query performance.
- Implement Constraints: Apply constraints like NOT NULL, UNIQUE, or CHECK to enforce data integrity and business rules.
What are the advantages and disadvantages of using a document-based NoSQL database?
Advantages:
- Flexible Schema: Accommodating diverse data structures and changes over time.
- Scalability: Designed for horizontal scalability. Suitable for handling large volumes of data.
- Ease of Use: Often use JSON-like formats
- Aggregation: Many offer powerful aggregation frameworks for complex data processing and transformation.
Disadvantages:
- Lack of Joins: Typically do not support joins. Harder to query data across multiple collections.
- Data Redundancy: Flexible schema can lead to data redundancy and inconsistency, with the same data may be stored in multiple documents.
- Complex Transactions: Handling ACID transactions can be more complex compared to relational databases.
How do you handle data normalization and denormalization in a SQL database (and what is it)?
Normalization:
- Normalization = organizing data into separate tables to minimize data redundancy and ensure data integrity based on logical relationships.
- Functional Dependencies: Identify functional dependencies between columns and ensure that each table represents a single concept.
- Normal Forms: Apply normalization rules to achieve various normal forms (1NF, 2NF, 3NF, etc.), each with specific criteria for data organization.
Denormalization:
- Denormalization = the process of introducing redundancy into a database by merging tables or adding redundant data to improve query performance
- Performance Trade-offs: Use denormalization to optimize specific queries, but be aware of the trade-offs, e.g. increased storage and complexity in data maintenance.
- Materialized Views: Consider using materialized views to store precomputed query results, providing a balance between normalization and denormalization.
What are some strategies for optimizing query performance in a SQL database?
- Use Indexes: on columns that are frequently used in queries, especially WHERE clauses and JOIN conditions.
- Optimize Joins: Minimize the number of joins in queries and avoid using cross joins. Use indexes on join columns to improve performance.
- Limit Result Sets: Use the LIMIT clause to restrict the number of rows returned, reducing the amount of data transferred and processed.
- Analyze Query Plans: Use query execution plans to analyze query performance and identify bottlenecks, such as full table scans or inefficient joins.
- Optimize Data Types: Choose appropriate data types for columns, considering the nature of the data and any constraints. Smaller data types can improve query performance.
- Denormalize Data: Consider denormalizing data or using materialized views to improve query performance, especially for complex aggregations or calculations.
- Cache Results: store frequently accessed query results, reducing the need for repeated database queries.
- Test and Monitor: Test to ensure it meets performance requirements and continue monitoring e.g. with Datadog if query is used by an endpoint to identify any further performance issues.
What is indexing, when should you use it and what are the pros and cons?
What is Indexing?
- Creating a data structure (an index) to improve the speed of data retrieval operations
- Used to quickly locate data without having to search every row in a database table
- Can be created using one or more columns of a database table, providing a pointer to the data in those columns
- Balanced-tree indexes (most common - efficient searching, insertion and deletion), or things like hash tables (good for searching for exact matches), bitmap indexes, etc
- Beneficial for frequent reads, large data sets, WHERE clauses, JOIN operations, and ORDER BY/GROUP BY operations.
When Should You Use It?
- Frequent Read Operations that require fast data retrieval.
- Large Data Sets: particularly beneficial for large data sets where scanning the entire table would be time-consuming.
- WHERE Clauses: columns that are often used in WHERE clauses to filter data
- JOIN Operations: columns that are used in JOINs to improve join performance
- Ordering and Grouping: columns that are frequently used for sorting (ORDER BY) or grouping (GROUP BY) data.
Pros of Indexing:
- Faster Data Retrieval
- Efficient Data Filtering i.e WHERE clauses
- Optimized Joins
- Improved Sorting and Grouping
Cons of Indexing:
- Slower Write Operations: they can slow down write operations (INSERT, UPDATE, DELETE) because the index needs to be updated
- Increased Storage Requirements
- Maintenance Overhead: require maintenance, such as rebuilding or reorganizing, to maintain their performance and minimize space usage
- Added complexity
- Over-Indexing: too many indexes can lead to performance issues. Find a balance between the number of indexes and the performance benefits provided
What is database caching? What are the pros and cons?
- Store frequently accessed data in a temporary storage area (cache) so future requests can be served more quickly.
- For databases, caching can be applied at various levels, including query result caching, object caching, or full-page caching
- Cache can be implemented in-memory or on disk, and it can be managed by the database system, the application, or a dedicated caching system.
Overview of Database Caching:
- Query Result Caching: The results of frequently executed queries are stored in the cache
- Object Caching: Database objects, such as rows or documents, are cached in the application layer
- Full-Page Caching: Entire pages or views generated from database queries are cached, reducing the need for repeated database queries and page rendering
- Dedicated Caching Systems: e.g. Redis and Memcached
Pros of Database Caching:
- Improved Performance: reduces the need for repeated database queries
- Reduced Database Load: the load on the database is reduced, freeing up resources for other operations
- Enhanced User Experience: more responsive
- Cost Savings: Reduced DB load, less DB scaling and additional hardware.
Cons of Database Caching:
- Data Inconsistency: Cached data can become outdated or inconsistent
- Cache Management Complexity
- Memory Usage: In-memory caching consumes additional memory resources
- Cache Invalidation: Deciding when and how to invalidate or update the cache can be challenging, especially in applications with frequent data updates.
- Initial Cache Miss Penalty
What are Constraints in SQL?
Constraints are used to specify the rules concerning data in the table. It can be applied for single or multiple fields in an SQL table during the creation of the table or after creating using the ALTER TABLE command. The constraints are:
NOT NULL - Restricts NULL value from being inserted into a column. CHECK - Verifies that all values in a field satisfy a condition. DEFAULT - Automatically assigns a default value if no value has been specified for the field. UNIQUE - Ensures unique values to be inserted into the field. INDEX - Indexes a field providing faster retrieval of records. PRIMARY KEY - Uniquely identifies each record in a table. FOREIGN KEY - Ensures referential integrity for a record in another table.
What is a Join? List its different types.
The SQL Join clause is used to combine records (rows) from two or more tables in a SQL database based on a related column between the two.
There are four different types of JOINs in SQL:
(INNER) JOIN: Retrieves records that have matching values in both tables involved in the join. This is the widely used join for queries.
LEFT (OUTER) JOIN: Retrieves all the records/rows from the left and the matched records/rows from the right table.
RIGHT (OUTER) JOIN: Retrieves all the records/rows from the right and the matched records/rows from the left table.
FULL (OUTER) JOIN: Retrieves all the records where there is a match in either the left or right table.
What is an Index? Explain its different types.
A database index is a data structure that provides a quick lookup of data in a column or columns of a table. It enhances the speed of operations accessing data from a database table at the cost of additional writes and memory to maintain the index data structure.
CREATE INDEX index_name /* Create Index /
ON table_name (column_1, column_2);
DROP INDEX index_name; / Drop Index */
There are different types of indexes that can be created for different purposes:
Unique and Non-Unique Index:
Unique indexes are indexes that help maintain data integrity by ensuring that no two rows of data in a table have identical key values. Once a unique index has been defined for a table, uniqueness is enforced whenever keys are added or changed within the index.
CREATE UNIQUE INDEX myIndex
ON students (enroll_no);
Non-unique indexes, on the other hand, are not used to enforce constraints on the tables with which they are associated. Instead, non-unique indexes are used solely to improve query performance by maintaining a sorted order of data values that are used frequently.
Clustered and Non-Clustered Index:
Clustered indexes are indexes whose order of the rows in the database corresponds to the order of the rows in the index. This is why only one clustered index can exist in a given table, whereas, multiple non-clustered indexes can exist in the table.
The only difference between clustered and non-clustered indexes is that the database manager attempts to keep the data in the database in the same order as the corresponding keys appear in the clustered index.
Clustering indexes can improve the performance of most query operations because they provide a linear-access path to data stored in the database.
What is Cursor? How to use a Cursor?
A database cursor is a control structure that allows for the traversal of records in a database. Cursors, in addition, facilitates processing after traversal, such as retrieval, addition, and deletion of database records. They can be viewed as a pointer to one row in a set of rows.
Working with SQL Cursor:
DECLARE a cursor after any variable declaration. The cursor declaration must always be associated with a SELECT Statement. Open cursor to initialize the result set. The OPEN statement must be called before fetching rows from the result set. FETCH statement to retrieve and move to the next row in the result set. Call the CLOSE statement to deactivate the cursor. Finally use the DEALLOCATE statement to delete the cursor definition and release the associated resources.
List the different types of relationships in SQL.
One-to-One - This can be defined as the relationship between two tables where each record in one table is associated with the maximum of one record in the other table.
One-to-Many & Many-to-One - This is the most commonly used relationship where a record in a table is associated with multiple records in the other table.
Many-to-Many - This is used in cases when multiple instances on both sides are needed for defining a relationship.
Self-Referencing Relationships - This is used when a table needs to define a relationship with itself.
Can you explain the architecture of PostgreSQL?
The architecture of PostgreSQL follows the client-server model.
The server side comprises of background process manager, query processer, utilities and shared memory space which work together to build PostgreSQL’s instance that has access to the data. The client application does the task of connecting to this instance and requests data processing to the services. The client can either be GUI (Graphical User Interface) or a web application. The most commonly used client for PostgreSQL is pgAdmin.