Scaling Data Flashcards
Encryption, authentication, authorization, and secure coding practices to protect against common vulnerabilities (e.g., SQL injection, XSS, CSRF).
How can data archiving help manage data volume in Rails applications?
Archiving removes historical or rarely-accessed data from primary tables, transferring it to separate tables or storage. This keeps active tables small and query performance high while retaining important data for compliance or reporting.
What is a good strategy for indexing in large tables with high query volumes?
Index based on the most frequently queried columns, but avoid over-indexing as this can slow down writes. Composite indexes on combined fields can be efficient for complex queries but should be carefully tested to avoid performance penalties on inserts and updates.
Why is read replica usage beneficial for data volume scalability?
Read replicas offload read operations from the main database, allowing high-volume applications to balance read-heavy loads across multiple database instances without compromising write performance on the primary server.
How does Rails handle massive CSV or batch data imports without downtime?
Use batch processing libraries (e.g., activerecord-import) and perform imports in small, manageable chunks within background jobs. This reduces memory usage and prevents locking on large data tables, allowing the app to handle imports without performance degradation.
What role do queuing systems play in handling high volumes of incoming data?
Queuing systems (e.g., Kafka, RabbitMQ) buffer and process high volumes of incoming data asynchronously, protecting the database from overload. This enables high-throughput data ingestion without directly impacting real-time app performance.
How can Rails manage large-scale transactional data growth without impacting performance?
Segment transactional data by type (e.g., completed vs. pending) and move older, less-accessed data to separate tables or partitions. Additionally, ensure that reporting queries on transactional data are optimized with indexes and caching to avoid locking and slowdowns.
What techniques can be used to handle rapid data growth in user-generated content tables?
Use pagination with cursors rather than offsets for efficient navigation in large tables. Also, implement soft deletes to reduce the frequency of actual deletions, which can be costly in terms of database performance, while cleaning up data on a schedule to prevent bloat.
What are key considerations when setting up data archiving for large-scale applications?
Data archiving moves inactive data to separate storage or tables. Key steps involve defining clear archival criteria (e.g., data age, relevance) and creating automated jobs for regular archiving. Use historical tables or external storage for archived data, ensure permissions are consistent, and set up separate indexes for archived data to optimize retrieval if needed.
How would you implement a composite index to optimize a complex query, and what are the downsides?
For a query like:
SELECT * FROM users WHERE city = ‘X’ AND age > 25 ORDER BY created_at
.. a composite index on (city, age, created_at) can improve performance. However, composite indexes increase write overhead, as all indexed columns are updated on inserts/updates, impacting write-heavy applications. Test for performance gains vs. write latency.
What impact does partitioning by geo-location have on data access and performance?
Partitioning by geo-location isolates data by region, reducing the amount of data to scan for location-based queries. This improves performance, especially in globally distributed systems with regional spikes in traffic. However, it can add complexity in merging cross-regional data and requires careful consideration of data localization laws if applicable.
What exactly is read replica usage, and how does it differ from caching?
Read replicas are database instances that duplicate the primary database and handle read-only queries. Unlike caching, which temporarily stores data in memory, replicas stay in sync with the main database and serve as backup for data recovery. Read replicas are ideal for offloading heavy read operations but may have slight replication delays, making them less suited for real-time updates.
What’s the relational database equivalent of sharding, and what are the pros and cons?
In MySQL and PostgreSQL, sharding is achieved by segmenting data into multiple databases based on a shard key. Pros include distributed load and storage scalability. However, relational sharding adds complexity in cross-shard joins and transactions, often requiring custom application logic to manage data consistency across shards.
What exactly are soft deletes in database management, and how are they used?
Soft deletes mark records as “deleted” using a flag (e.g., deleted_at timestamp) rather than removing them. This approach retains data for auditing and recovery, keeping records accessible if needed. It avoids database fragmentation caused by frequent deletions but can clutter tables, requiring periodic cleanup to avoid bloat.
What’s the difference between cursor-based pagination and offset-based pagination?
Offset pagination retrieves a page of data by skipping rows, while cursor pagination tracks the last item’s position and retrieves the next batch from there. Cursor pagination is more efficient for large data sets since it doesn’t require scanning all skipped rows, unlike offset pagination, which becomes slow with high offsets.
How would you set up automated data archiving to ensure high performance?
Schedule background jobs (e.g., weekly) to move old records to archive tables or external storage. Archive in batches to avoid table locking and test data retrieval to ensure archived data remains accessible. Track archived data volumes to determine the right frequency and thresholds for archiving without overloading storage.