Systems Design Flashcards

1
Q

Be able to talk about trade-offs for your design

A

yeah

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

difference between strong vs eventual consistency, when would you use either

A

https://hackernoon.com/eventual-vs-strong-consistency-in-distributed-databases-282fdad37cf7

strong consistency means if you make a write request to a db, all nodes of that db are updated before any read request can be processed.
You would want this in a case where information must be accurate each time you get it, ticket reservations might be a good use case.
Trade-offs: This increases latency because every write request requires all nodes to be updated from that write request.

eventual consistency is when the data will eventually be identical across all nodes. If I make a write request to one node, the other nodes can still be read from and the data may be inconsistent.
Banks (checks, charge backs, refunds), online store inventory, social media all use eventual consistency to improve performance. It works because of how much access we have to the internet that things can be constantly updated with ease.
You want this if you need very low latency, but the data does not always need to be accurate. Likes on an instagram post, certain banking transactions, instragram likes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Load balancing, what is it? How would you achieve load balancing a url hosted by many machines with different IP’s?

A

Load balancing is the act of distributing communication or work across multiple different pieces of hardware

For a DNS/url, if a user requests the name, google.com, it would return the load balancer url and the load balancer would decide which machine to route the traffic to. This lets us use only 1 public ipv4 address, which is hard to come by, and keep our machines private from online attackers

How do we decide how to route the traffic once the lb gets it? That can be many factors, 1 could be round-robin, just go in a circle and everyone gets equal traffic. A better way would be to route traffic to the server with the least current load based on a metric. This metric could be different for many systems, reading large files would require more RAM vs computing large numbers would need more CPU time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

cpu vs io bound operations, what is the difference, examples of each

A

https://www.brainscape.com/flashcards/leetcode-heaps-10231570/packs/18112311

cpu bound: a thread of execution is cpu bound if its performance is correlated with the cpu
This means a cpu bound process will increase or decrease performance based on the cpu strength

IO bound: a thread of execution is IO bound if its performance is correlated to the subsystem, peripheral systems, or something that the cpu does not control
This includes reading/writing to harddrive or waiting for responses from your network. If you are writing to an SSD vs a spinning HD, the SSD will be faster and the cpu will have little to no effect on its performance. If you are calling an external API, if the API is slow, there is nothing you can do to increase the external server’s response time other than caching and other methods that would have to be done locally

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How do you determine if a system is IO bound vs CPU bound?

A

monitoring the task manager, you can see that a cpu bound program will have high CPU time and low idle time, vs an IO bound operation will have high idle time and almost no CPU time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

If you were designing a scheduler, how would you give different priority to different programs if you knew each program coming in was more CPU or IO bound?

A

https://www.geeksforgeeks.org/cpu-scheduling-in-operating-systems/

OS have these goals in mind:
Max CPU utilization [Keep CPU as busy as possible]
Fair allocation of CPU.
Max throughput [Number of processes that complete their execution per time unit]
Min turnaround time [Time taken by a process to finish execution]
Min waiting time [Time a process waits in ready queue]
Min response time [Time when a process produces first response]

Question that helps answer this question: Why is it important for the scheduler to distinguish I/O-bound programs from CPU-bound programs?
Answer: I/O-bound programs have the property of performing only
a small amount of computation before performing IO. Such programs
typically do not use up their entire CPU quantum. CPU-bound programs, on the other hand, use their entire quantum without performing any
blocking IO operations. Consequently, one could make better use of the
computer’s resources by giving higher priority to I/O-bound programs
and allow them to execute ahead of the CPU-bound programs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

If you had a CPU that could handle 16 threads and you knew all programs run on it would be 100% CPU bound, nothing is IO bound, how many execution threads would you let your sytem run?

A

One thread, because threading takes up CPU time and if each program is CPU bound adding threads will decrease the performance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How does a scheduler determine if a thread is more IO or CPU bound?

A

https://unix.stackexchange.com/questions/254864/how-does-the-linux-scheduler-determine-if-a-process-is-i-o-bound-or-cpu-bound

A scheduler knows this by looking at whether or not the thread released its CPU time slice early or it used its entire time. If a process uses the entire CPU time slice, it is likely CPU bound compared to when a process releases its CPU time slice early; in the ladder case, it is probably waiting on an IO operation, hence IO bound

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

You are given 100 big objects that are unrelated to each other. Each requires you to perform a lot of complicated math, then store it on your computer in different locations. You have a dual-core processor that can handle 8 threads. How do you design the program?

A

I would have 1 thread per core perform the CPU bound math task, then the other 3 threads would be responsible for storing the data on the CPU in different locations. If the math tasks are not 100% CPU bound, then it would be a 2/2 split

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How would you design an efficient database?

A

IDFK!!!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

When you type in a url into the browser, how is that converted to an IP? if your DNS routes to multiple servers and is horizontally scaled, what problems could that cause? What services could you use to provide load balancing? What do we call the act of preserving a user session across multiple servers when we scale? Talk about maintain redundancy while horizontally scaling

A

A DNS server manages to convert a website into an IP and send it. When you search the url, you are typically given a list of servers that you specifically are going to use. These servers are cached in some way so you do not have to repeatedly get the server IP’s.

This could cause a problem bc if your session data for a website is stored locally on that server, then when you get a different server with DNS now you do not have your session data and you have to login again. At worst, if you are shopping and you were shopping on 3 different servers, each server will have different parts of your cart and no way to combine them when you checkout

A solution to that would be either to route a specific user to the same server each time or we could have user data abstracted from the specific servers and onto an external hard drive so all user data is consistent across all servers.

Introducing a single hard drive only causes a different problem. We started with 1 server, added redundancy to reduce issues, now we are back to only have 1 hard drive. To solve this we would use RAID which basically means mirroring storage across multiple hard drives.

When we preserve user data across multiple servers so they do not have to login again and such, we call those sticky sessions

You can use software as a load balancer through AWS elastic load balancer, HA proxy, and other open source softwares or you can purchase hardware from companies that performs the task

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How would you add redundancy to a database?

A

You would add database replication. You have a master/slave relation, where if anything happens to the master it is copied to all the slaves, they are all identical copies.
This is good for a read-heavy site because any of the masters or slaves can be queried for data which speeds up that process heavily. Writes will have to go to the master node
To follow this up, you can have a master-master-slave relation with multiple masters to add redundancy; when 1 master gets a write it is copied to the other master

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Talk about caching, different mechanisms to achieve it, different use cases

A

In regards to sticky sessions and maintaining it, a good way to ensure the user goes back to the same server is to use cookies, encode a string, the DNS/load balancer decodes it and send it to the designated server. This obfuscates the server’s private IP. If the server IP changes, the DNS/load balancer’s decoding will be updating as well.

Craig’s list uses pure html files to statically serve their posts instead of dynamically hosting them like many other sites. They do this because it is very easy to read bytes from disk, so their website can load very efficiently. The problem is if they have 100k posts and they want to change anything about the site, they have to change that in all 100k of those locations vs changing 1 location for dynamic websites. This can be considered server-side caching bc the post is written once and saved, then everyone else reads that exact same cached html file

Redis vs Memecache
https://stackoverflow.com/questions/10558465/memcached-vs-redis
Both are in-memory data stores that work as caches and can help speed up databases or anything that might be expensive to repeatidly generate

Redis is more popular and more powerful because it offers more features that memecahe. Memcache only supports strings while Redis can support all different data types and it has cluster support and persistence support when it loses power.

memcache is a great solution that stands for memory cache. You can say it is 1-tier above a MySQL database, some data is in RAM, but most of the data is searchable from a pkey or unique keys. Memcache is a server that is always running which stores data in RAM as key-value-pairs. One a key is searched, if it is not in the cache, it is added. Once the cache runs out, it performs a FIFO operation where the “first-in” is related to the index that was the last used and it removes that index and replaces it with the new one.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

what is database partitioning?

A

This is when you partition a database to have certain data on different servers. FB did this early on when they partitioned MIT vs Hardvard users on different servers bc their 1 server was not large enough. You can also partition the users based on last name or some other data point

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

what does high availability mean in the context of a resource? (database, load balancer, server, ect)

A

https://docs.oracle.com/cd/B19306_01/server.102/b14210/overview.htm

HA or high availability means that each resource is constantly checking the status of the other ones in the cluster to make sure they are available; if one of them fails, it is quickly detected and remedied

Four characteristics of a HA system are:

  1. Reliability: they are robust and do not easily fail
  2. Recoverability: When a failure occurs, what do you do? You should know the errors that can occur ahead of time and plan for those errors accordingly so you can act fast. If a critical database table is deleted or access is lost, how do you plan to recover the table and give availability back to your customers?
  3. Timely error detection: If a component fails, the time it takes to resolve the failure includes detection + resolution time. Reducing the detection time is key in achieving a highly available system because the sooner you detect it, the sooner you solve it
  4. Continuous operations: Users are always able to access your systems and there is little to no downtime to perform maintenance or updates. Actions such as moving database tables to different locations (physical or otherwise), updating tables, adding cpu power or other hardware, should all be transparent to the user
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

List types of traffic and the respective ports that coincide with them

A

TCP (transmission control protocol), 80
SSL (secure socket layer) used for HTTPS, 443
SSH 22

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

When should an operation be asynchronous vs synchronous?

A

idfk

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Clarifying questions to ask when you begin a systems design interview

A

Who is going to use this? (students, professionals, children)
How are they going to use this? (app, web, on the go, etc)
How many users do we expect to handle at any given time
What is the goal of the system
What are the inputs/outputs of the system
How much data will we have to handle per user
How many requests per second do we need to handle
What is the expected read:write ratio

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Name some hashing algorithms, explain their uses and such

A

The goal of a hashing function is for file transfer. If I send a file from A to B, I can check the hash of the algorithm to see if it matches the original to make sure there has been no tampering

Public hash keys are for encrypting data, while private is for decrypting and confirming integrity of file

Hashing algos need to be quick enough for people to use them, but slow enough that you cant easily iterate through the algo and create targeted hash values, if you can create a target hash value you can cause collisions

hash collisions are when you have 2 different files evaluating to the same hash value, this is bad for security reasons bc now you cannot be sure of the integrity of the file

hash algorithms are supposed to protect against this by using an “avalanche” function, when you change 1 bit of the original file, the hashed value is completely different

MD5 hash is an algorithm, it takes in data and hashes it to a 128-bit signed value that “ideally” is completely unique and cannot be recreated. MD5 today is broken and can be tampered with.

SHA-1, SHA-2, SHA-3 are the new hashing functions that are safe to use

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

ACID-compliant databases

A

Atomic: a transaction in the database is all or nothing, if part of the write fails, the entire action will fail, you will not get partially run writes
Consistent: 2 people making a query on unchanged data will get the same result. 2 queries that follow or violate rules will be allowed or disallowed to complete
Isolated: the ability to process multiple transactions at once as long as they do not affect one another
Durable: when your technology fails, such as a power outage or software crash, the data stays intact. We do not lose data when failures happen

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

latency vs throughput

A

https://community.cadence.com/cadence_blogs_8/b/sd/posts/understanding-latency-vs-throughput

latency is the time required to do an action, if I click something and it takes 5 seconds, the latency is 5 seconds

throughput is the number of actions that are executed in a given time. If each operation takes 5 seconds to perform, but we process 500 a second, the throughput is 500/second

22
Q

availability vs consistency, CAP theorem

A

http://ksat.me/a-plain-english-introduction-to-cap-theorem

Consistency: every read has the most recent write or returns an error that it cannot get the most recent write
Availability: every request receives a response, but there is no guarantee it is the most up to date information
Partition Tolerance: the system continues to operate if you have partial network failures

CAP theorem means we can really only pick 2 of these, we need partition tolerance, so it is between CP and AP

CP, each read is the most updated, but each read might be blocked behind a write. This is good if each read cannot be wrong. strong consistency

AP, each request is fast but not always accurate, writes take a long time to propagate through all the different database nodes. This is eventual consistency

23
Q

weak, eventual, and strong consistency

A

weak consistency: when you have a write, reads may or may not see it. A great example is voice calls, if you lose connection and you are both trying to talk, when you regain connection you do not hear everything you both said

eventual consistency: after a write, each read will eventually see the new data. You can see this approach with emails. You may not immediately get the email, but they will show up in order eventually

strong consistency: after a write, all reads will see it, all database nodes are updated synchronously. This works well with transaction systems

24
Q

availability patterns, name the (they are explained in detail elsewhere)

A

Fail-over: when a master service fails, the slave picks up, or a master-master pair handles the service
replication: data is replicated across master-slave or master-master
availability in parallel or sequence: if 2 services are in sequence, their availability rates affect each other, if they are in parallel they improve each other’s availability

25
Q

Talk about fail-over patterns

A
  1. active-passive (master-slave) fail-over: heartbeats are sent between systems, when the heartbeat is interrupted the passive server takes over the IP and resumes service. The passive system can be running on stand-by or it could need to be cold-started; downtime may vary depending
  2. active-active (master-master) fail-over: both servers are active and receiving traffic. When 1 fails the other resumes the full traffic load until the other comes back on. Both servers would need a DNS entry if they are public-facing; if they are internal, then we need to determine it through software

the problems with fail-over are that data can be lost if the master fails before the slave can receive its writes. This is true for master-master as well. It also adds a lot of complexity and hardware requirements.

26
Q

Talk about database replication

A

master-master: master-master is when you have 2 servers that can be written to, a background service handles copying new writes across so they remain eventually consistent
master-slave: we have 1 write server and 2 read servers. The master can be written to or read from, the slave can only be read from. The master handles eventual consistency between master-slave

cons:
potential loss if 1 master fails before it can write to the slaves
if it is master-slave and we are getting lots of writes, the replication of those writes will bog down the slaves, causing massive replication lag
more complexity

27
Q

availability in parallel or in sequence

A

availability is described by numbers as 9s, 3 9’s is 99.9% available, 4 9’s is 99.99% available

Availability in parallel vs in sequence
If a service consists of multiple components prone to failure, the service’s overall availability depends on whether the components are in sequence or in parallel.

In sequence
Overall availability decreases when two components with availability < 100% are in sequence:

Availability (Total) = Availability (Foo) * Availability (Bar)
If both Foo and Bar each had 99.9% availability, their total availability in sequence would be 99.8%.

In parallel
Overall availability increases when two components with availability < 100% are in parallel:

Availability (Total) = 1 - (1 - Availability (Foo)) * (1 - Availability (Bar))
If both Foo and Bar each had 99.9% availability, their total availability in parallel would be 99.9999%.

28
Q

What is a CDN?

A

Content delivery system

This is used as a globally distributed network of proxy servers that can serve content from different locations to users around the globe. You would typically use this for html/css/js files, photos, or videos. The users get data from the CDN closest to them and your servers do not have to manage every request from the user!

Push CDN, minimize traffic, maximize storage: the CDN’s are updated every time the server has new data. This requires the CDN to have enough storage to hold all of the updated data at one time. Good for small services that do not contain a ton of data. You can set a TTL to determine when new information is pushed to each CDN or the data expires

Pull CDN, minimize storage, maximize traffic: These CDN’s are updated each time a user makes a request for an item. The latency is high until the CDN caches all the new information. You can set a TTL to determine when the cache items expire to help minimize storage. Sites with heavy traffic work very well with pull CDN because the cache is constantly up to date and unused items are removed

Disadvantage(s): CDN
CDN costs could be significant depending on traffic, although this should be weighed with additional costs you would incur not using a CDN.
Content might be stale if it is updated before the TTL expires it.
CDNs require changing URLs for static content to point to the CDN. this is for html/css/js/photos, the url cannot point to the server!

29
Q

Big data architecture, what are some aspects of it? Examples of solutions for them too

https://www.quastor.org/p/an-introduction-to-big-data-architectures

A

Data Sources - It’s pretty dumb to build a distributed system for managing data if you don’t have any data. So, all architectures will have at least one source that’s creating data. This can be a web application, an IoT device, etc.

Data Storage - Data for batch processing operations is typically stored in a distributed file store that can hold high volumes of large files in various formats. This kind of store is often called a data lake. (AWS s3, cloud storage)

Batch Processing - Because the data sets are so large, often a big data solution must process data files using long-running batch jobs to filter, aggregate, and prepare the data for analysis. Map Reduce is a very popular way of running these batch jobs on distributed data. (Map Reduce)

Real-time Message Ingestion - If the solution includes real-time sources, the architecture must include a way to capture and store real-time messages for stream processing. This portion of a streaming architecture is often referred to as stream buffering. (kafka is an example, this process is mainly event-driven)

Stream Processing - After capturing real-time messages, the solution must process them by filtering, aggregating, and otherwise preparing the data for analysis. Apache Storm is a popular open source technology that can handle this for you. (apache kafka streams helps process the data streams from a kafka source)

Analytical Data Store - Many big data solutions prepare data for analysis and then serve the processed data in a structured format that can be queried using analytical tools. This can be done with a relational data warehouse (commonly used for BI solutions) or through NoSQL technology like HBase or Hive. (postgres, mysql, NoSQL servers)

Orchestration - Big data solutions consist of repeated data processing operations that transform source data, move data between multiple sources and sinks, load the processed data into an analytical data store, or push the results straight to a report or dashboard. You can use something like Apache Oozie to orchestrate these jobs. (apache oozie)

30
Q

What problem does lambda architecture solve?

A

https://www.quastor.org/p/an-introduction-to-big-data-architectures

Lambda architecture solves the problem with querying big data. Big data often times has to be queried in batches, which means queries can take hours! This means your answer could be incorrect by the time it is computed.

The Lambda Architecture solves this by creating two paths for data flow: the cold path and the hot path.

The cold path is also known as the batch layer. It stores all of the historical data in raw form and performs batch processing on the data.

The raw data in the batch layer is immutable. The incoming data is always appended to the existing data, and the previous data is never overwritten.

The cold path has a high latency when answering analytical queries. This is because the batch layer aims at perfect accuracy by processing all available data when generating views.

The hot path is also known as the speed layer, and it analyzes the incoming data in real time. The speed layer’s views may not be as accurate or complete as the batch layer, but they’re available almost immediately after the data is received.

The speed layer is responsible for filling the gap caused by the batch layer’s lag and provides views for the most recent data.

The hot and cold paths converge at the serving layer. The serving layer indexes the batch view for efficient querying and incorporates incremental updates from the speed layer based on the most recent data.

With this solution, you can run analytical queries on your datasets and get up-to-date answers.

A drawback to the lambda architecture is the complexity. Processing logic appears in two different places (the hot and cold paths) and they use different frameworks.

The Kappa Architecture is meant to be a solution to this, where all your data flows through a single path, using a stream processing engine.

31
Q

explain map reduce

A

https://www.youtube.com/watch?v=43fqzaSH0CQ

map reduce has 3 phases: map, shuffle, reduce

example: how do I count the number of words in a list of documents

map:
1000 documents, 10 machines, each machine processes 100 documents. The machines do not talk to each other during this phase, they only process what they are given

shuffle:
if there are 1000 different words that have ben counted in the map phase, then the shuffle phase will make sure all the different keys, in this case words, are combined and sent to a single machine. If we have 10 machines, each machine will receive 100 different keys to process in the reduce stage

reduce:
this state is when we combine all the results that we split during the map stage into 1 final result

32
Q

Load balance, layer 4 and layer 7 type

A

Layer 4 load balancing is load balancing based on the transport layer. This does not look into the payload at all, it only uses the source IP, destination, and ports in the headers to make any determination

Layer 7 load balancing looks at the application layer. This can involve the header, payload, cookies, or other items. Layer 7 involves receiving the incoming message, terminating the initial network connection, reading the message, then opening up a new network connection to the desired destination based on whatever parameters we decide to set.

33
Q

Load balancer pros and cons

A

Pros: hides backend servers, allows for horizontal scaling, lets us direct traffic to places we want, multiple points of failure
Cons: complexity, we need to worry about not having local storage on any of the servers, the load balancer can be a bottleneck if we do not give it enough resources

34
Q

Load balancer pros and cons

A

Pros: hides backend servers, allows for horizontal scaling, lets us direct traffic to places we want, multiple points of failure, load balancer can perform decrypting so the servers do not have to perform these potentially expensive operations
Cons: complexity, we need to worry about not having local storage on any of the servers, the load balancer can be a bottleneck if we do not give it enough resources

35
Q

reverse proxy, pros and cons

A

pros: increase security, allows reverse proxy to decrypt messages which can save compute power on server, homogenizes the responses bc internet only talks to proxy server, allows backend servers to change ip’s without major issues
NGINX, HAproxy are solutions that can do reverse proxy and layer 7 load balancing
cons: complexity, if you have 1 reverse proxy server it is a single point of failure, need to worry about fail-over patterns

36
Q

What is the OSI model?

A

OSI model is the Open Systems Interconnection Model, it represents the framework for network functions and it has 7 layers

  1. Physical layer, the hardware
  2. Data link layer, transfers data between adjacent nodes on the network or between nodes on a LAN. Allows us to transfer items between network entities
  3. Network layer, receives data frames from data link layer and delivers them to the addresses inside the frame
  4. transport layer, manages delivery (tcp/udp), error checking, and size regulation and is responsible for transfer of data between systems
  5. Session layer, manages connections and conversations between different computers. Connection between computers happens at layer 5
  6. Presentation layer, translates data from the application and formats it
  7. Application layer, both user and application layer interact directly with the software. Web browser, desktop app
37
Q

What is server/client side discovery?

A

https://microservices.io/patterns/server-side-discovery.html

server side discovery is a pattern that allows us to use a load balancer to connect to ephemeral endpoints. Since the servers can be virtualized, the IP’s/ports are not always the same. When a service starts up it registers to a service registry, which then allows the load balancer to communicate with the service registry to route traffic.

https://microservices.io/patterns/client-side-discovery.html

client side discovery is the same as server side discovery but from the client’s perspective. They connect to a reverse-proxy http server that knows about the service registry and routes it to the ephemeral services

38
Q

Database Federation, what is it?

A

Database federation is when you break up the database from monolithic to microservice-like. You have users, forums, messages all in different databases. This reduces replication lag, reduces read/write traffic, reduces replication between masters/slaves, and improves the effectiveness of caching bc each database will be able to hold more of its data relative to size in memory

cons:
joins across tables are very expensive and difficult
we need to keep track of which federations require reads/writes so we can use them properly
complexity
not effective if there are huge tables or complex functions for querying, because it can end up causing worse lag due to poor join efficiency and such

39
Q

Database sharding, what is it?

A

Sharding is when you distribute users of different characteristics to different databases. Users A-F only go to 1, E-I to another.
Similar to federation we get reduced read/write lag, more cache size, and less replication. Another thing is table indexes are reduced, which results in faster queries. If one shard goes down, the others work without issue as long as they do not communicate with the down shard

cons:
more hardware and software complexity
joins are difficult between shards
the application must know which shard to communicate with for queries, which can make them more complex
Data distribution could be lopsided, There could be a lot of A-F users, but no E-I users. This could be solved with a good hashing algorithm to distribute the users appropriately.

40
Q

Denormalization in a database, what is it? pros and cons

A

This is where we add duplicate data into the database to increase performance on expensive reads. This is done at the cost of increasing write complexity. In some systems, the read:write ratio can be 100:1 or even 1000:1.

Cons:
duplicate data, improved complexity on write statements, they can perform worse under a heavy write load at times

41
Q

Constraints in a database, what are they?

A

Constraints are rules that apply on the type of data in a table. We can require data be of a certain type, it has a foreign key relationship, not null, unique to a column, and providing a default value if none are provided

42
Q

SQL tuning, what is it?

A

Much like profiling in java, this is the same concept of drilling down into bottle necks and optimizing.

You need a benchmark, like jmeter to run heavy loads and a profiler like visualVm. Common problems:
Using good Indicides:
Columns that you are querying (SELECT, GROUP BY, ORDER BY, JOIN) could be faster with indices.
Indices are usually represented as self-balancing B-tree that keeps data sorted and allows searches, sequential access, insertions, and deletions in logarithmic time.
Placing an index can keep the data in memory, requiring more space.
Writes could also be slower since the index also needs to be updated.
When loading large amounts of data, it might be faster to disable indices, load the data, then rebuild the indices.

Other Items:
MySQL dumps to disk in contiguous blocks for fast access.
Use CHAR instead of VARCHAR for fixed-length fields.
CHAR effectively allows for fast, random access, whereas with VARCHAR, you must find the end of a string before moving onto the next one.
Use TEXT for large blocks of text such as blog posts. TEXT also allows for boolean searches. Using a TEXT field results in storing a pointer on disk that is used to locate the text block.
Use INT for larger numbers up to 2^32 or 4 billion.
Use DECIMAL for currency to avoid floating point representation errors.
Avoid storing large BLOBS, store the location of where to get the object instead.
VARCHAR(255) is the largest number of characters that can be counted in an 8 bit number, often maximizing the use of a byte in some RDBMS.
Set the NOT NULL constraint where applicable to improve search performance.

43
Q

AVL Trees, red/black trees, B trees

A

AVL trees:
https://www.youtube.com/watch?v=Jj9Mit24CWk

red/black trees:
https://www.youtube.com/watch?v=bqOSo1f1jbo

B trees:
https://www.youtube.com/watch?v=SI6E4Ma2ddg

43
Q

AVL Trees, red/black trees, B trees

A

AVL trees: https://www.youtube.com/watch?v=Jj9Mit24CWk
They are self-balancing, they keep track of the balance and perform 4 types of rotations based on which way the balance is off

red/black trees: https://www.youtube.com/watch?v=bqOSo1f1jbo
root is always black, null nodes are black, every red node needs to have a black root/child, every node inserted is considered a red node at first. This helps us maintain the self-balancing aspect while reducing the rotations required in the avl tree. The avl tree is always balanced, but the red-black tree is optimal bc it reduces the rotations while improving the query speeds

B trees:
https://www.youtube.com/watch?v=SI6E4Ma2ddg
Even with the optimal trees, we have an issue with their height getting very large. We can change this by allowing multiple keys per node and multiple children per node!

Each node’s keys are stored in ascending order, the children of each node is stored between the keys such that the child values are between them.
Every node, except the root, needs to have M/2 rounded up children. This ensures we are able to shrink the height later by combining these “half” nodes in the same level to a “legal” node. 3/2 is a 3-2 B tree which is common.
A parent must have n-1 keys where N is its children.
Finally, all leaf nodes must appear on the same level

44
Q

NoSQL, what is it, benefits and cons

A

NoSql is a collection of data that is represented in a graph, key-value, document, or wide column store. The data is denormalized and forgoes ACID for BASE:

Basically available: guarantees availability
Soft State: the data can change without user input
Eventual consistency: the system will become consistent over time

Choosing between SQL and NoSQL is very important, SQL is safer bc it has been done and can be done, NoSql can offer pay-offs that sql cannot

45
Q

Wide column store

A

Of the many ways to use NoSql (key-value, document, graph) wide column is the most popular. BigTable and Cassandra are both wide column storage systems used at google and facebook.

46
Q

SQL or NoSql

A

Reasons for SQL:

Structured data
Strict schema
Relational data
Need for complex joins
Transactions
Clear patterns for scaling
More established: developers, community, code, tools, etc
Lookups by index are very fast

Reasons for NoSQL:

Semi-structured data
Dynamic or flexible schema
Non-relational data
No need for complex joins
Store many TB (or PB) of data
Very data intensive workload
Very high throughput for IOPS

Sample data well-suited for NoSQL:

Rapid ingest of clickstream and log data
Leaderboard or scoring data
Temporary data, such as a shopping cart
Frequently accessed ('hot') tables
Metadata/lookup tables
47
Q

Distributed joins, is this and what types are they

A

General overview
https://ignite.apache.org/docs/latest/SQL/distributed-joins

Detailed about each one (idk how accurate these are though?)
https://www.singlestore.com/blog/scaling-distributed-joins/

A distributed join is a SQL statement with a join clause that combines two or more partitioned tables. If the tables are joined on the partitioning column (affinity key), the join is called a colocated join. Otherwise, it is called a non-colocated join.

There are lots of different types of distributed joins. A few joins do not require moving data from 1 shard to another to perform a join, but others do.

The ones that do not require data movement often require good schema planning ahead of time for things such as reference tables and joins based on shard keys. The ones that do require data movement often utilize reference tables/shard keys a bit to prevent data movement, but the worst case scenarios would have them moving large parts data around.

48
Q

Shard key, what is it?

A

https://www.bugsnag.com/blog/mongo-shard-key

A shard key is a key that determines how databases are going to be distributed via shards. You want to shard partitions based off a key that will evenly distribute data across your nodes. A geo location would be bad bc most ppl live in cities. A simple shard key would be to hash the ID of your item and base the shards off of that.

49
Q

Address collisions, in hash maps or networking?

A

idk