Scaling Data Flashcards

Question 1

Q

How can data archiving help manage data volume in Rails applications?

Answer

A

Archiving removes historical or rarely-accessed data from primary tables, transferring it to separate tables or storage. This keeps active tables small and query performance high while retaining important data for compliance or reporting.

Question 2

Q

What is a good strategy for indexing in large tables with high query volumes?

Answer

A

Index based on the most frequently queried columns, but avoid over-indexing as this can slow down writes. Composite indexes on combined fields can be efficient for complex queries but should be carefully tested to avoid performance penalties on inserts and updates.

Question 3

Q

Why is read replica usage beneficial for data volume scalability?

Answer

A

Read replicas offload read operations from the main database, allowing high-volume applications to balance read-heavy loads across multiple database instances without compromising write performance on the primary server.

Question 4

Q

How does Rails handle massive CSV or batch data imports without downtime?

Answer

A

Use batch processing libraries (e.g., activerecord-import) and perform imports in small, manageable chunks within background jobs. This reduces memory usage and prevents locking on large data tables, allowing the app to handle imports without performance degradation.

Question 5

Q

What role do queuing systems play in handling high volumes of incoming data?

Answer

A

Queuing systems (e.g., Kafka, RabbitMQ) buffer and process high volumes of incoming data asynchronously, protecting the database from overload. This enables high-throughput data ingestion without directly impacting real-time app performance.

Question 6

Q

How can Rails manage large-scale transactional data growth without impacting performance?

Answer

A

Segment transactional data by type (e.g., completed vs. pending) and move older, less-accessed data to separate tables or partitions. Additionally, ensure that reporting queries on transactional data are optimized with indexes and caching to avoid locking and slowdowns.

Question 7

Q

What techniques can be used to handle rapid data growth in user-generated content tables?

Answer

A

Use pagination with cursors rather than offsets for efficient navigation in large tables. Also, implement soft deletes to reduce the frequency of actual deletions, which can be costly in terms of database performance, while cleaning up data on a schedule to prevent bloat.

Question 8

Q

What are key considerations when setting up data archiving for large-scale applications?

Answer

A

Data archiving moves inactive data to separate storage or tables. Key steps involve defining clear archival criteria (e.g., data age, relevance) and creating automated jobs for regular archiving. Use historical tables or external storage for archived data, ensure permissions are consistent, and set up separate indexes for archived data to optimize retrieval if needed.

Question 9

Q

How would you implement a composite index to optimize a complex query, and what are the downsides?

Answer

A

For a query like:

SELECT * FROM users WHERE city = ‘X’ AND age > 25 ORDER BY created_at

.. a composite index on (city, age, created_at) can improve performance. However, composite indexes increase write overhead, as all indexed columns are updated on inserts/updates, impacting write-heavy applications. Test for performance gains vs. write latency.

Question 10

Q

What impact does partitioning by geo-location have on data access and performance?

Answer

A

Partitioning by geo-location isolates data by region, reducing the amount of data to scan for location-based queries. This improves performance, especially in globally distributed systems with regional spikes in traffic. However, it can add complexity in merging cross-regional data and requires careful consideration of data localization laws if applicable.

Question 11

Q

What exactly is read replica usage, and how does it differ from caching?

Answer

A

Read replicas are database instances that duplicate the primary database and handle read-only queries. Unlike caching, which temporarily stores data in memory, replicas stay in sync with the main database and serve as backup for data recovery. Read replicas are ideal for offloading heavy read operations but may have slight replication delays, making them less suited for real-time updates.

Question 12

Q

What’s the relational database equivalent of sharding, and what are the pros and cons?

Answer

A

In MySQL and PostgreSQL, sharding is achieved by segmenting data into multiple databases based on a shard key. Pros include distributed load and storage scalability. However, relational sharding adds complexity in cross-shard joins and transactions, often requiring custom application logic to manage data consistency across shards.

Question 13

Q

What exactly are soft deletes in database management, and how are they used?

Answer

A

Soft deletes mark records as “deleted” using a flag (e.g., deleted_at timestamp) rather than removing them. This approach retains data for auditing and recovery, keeping records accessible if needed. It avoids database fragmentation caused by frequent deletions but can clutter tables, requiring periodic cleanup to avoid bloat.

Question 14

Q

What’s the difference between cursor-based pagination and offset-based pagination?

Answer

A

Offset pagination retrieves a page of data by skipping rows, while cursor pagination tracks the last item’s position and retrieves the next batch from there. Cursor pagination is more efficient for large data sets since it doesn’t require scanning all skipped rows, unlike offset pagination, which becomes slow with high offsets.

Question 15

Q

How would you set up automated data archiving to ensure high performance?

Answer

A

Schedule background jobs (e.g., weekly) to move old records to archive tables or external storage. Archive in batches to avoid table locking and test data retrieval to ensure archived data remains accessible. Track archived data volumes to determine the right frequency and thresholds for archiving without overloading storage.

Question 16

Q

When does a composite index become a performance liability?

Answer

A

Composite indexes are most beneficial for queries matching the index order exactly. They can cause slower write performance in large tables with frequent updates since each change requires re-indexing. Excessive composite indexing increases storage and maintenance overhead, making indexes counterproductive if they aren’t frequently used in queries.

Question 17

Q

How does a Rails app direct traffic to specific database shards?

Answer

A

Custom middleware or gems like Octopus route database queries to the correct shard based on a sharding key (e.g., user ID). This involves configuring each shard in database.yml and setting up conditional logic to route requests at runtime, so users always access the correct data without manual intervention.

Question 18

Q

What are common challenges when using read replicas in Rails applications?

Answer

A

Challenges include handling replication lag (data may not be immediately up-to-date) and managing read vs. write consistency. Apps often implement conditional logic to route critical real-time reads to the primary database, ensuring consistency for sensitive operations while offloading non-critical reads to replicas.

Question 19

Q

How does sharding impact Rails app code and data access patterns?

Answer

A

Sharding introduces complexity in querying and joining data across shards. In Rails, cross-shard joins are typically avoided, requiring alternative patterns like aggregating data at the application layer or using caching for frequently accessed cross-shard data, ensuring minimal direct shard interaction.

Brainscape's Knowledge GenomeTM

Scaling Data Flashcards

Encryption, authentication, authorization, and secure coding practices to protect against common vulnerabilities (e.g., SQL injection, XSS, CSRF).

Brainscape's Knowledge Genome^TM