Solutions Architect Problem Scenarios Flashcards

1
Q

I work for Genentech, doing bioinformatics work.

We’re sequencing the DNA of 10,000 patients with Parkinsons Disease and then analyzing those genomes to try to find genetic risk factors for developing parkinsons. Uncompressed, each genome is about 770MB.

Here’s what I need:
- When we get the data, it’s loaded into our on-prem filesystem. I need that data to be automatically replicated to AWS.

(AWS Datasync or AWS Storage gateway to upload to S3; maybe Multi-part upload?)

  • Each time a new genome is uploaded to AWS, it should automatically:
    ** Create a low-cost backup of the genome that can’t be deleted. It’s really unlikely we’d need this but need to do it for compliance reasons. If for some reason we needed that data, we wouldn’t need it immediately.
    (think S3 Glacier or Deep Glacier + Glacier Vault Lock)
    ** Prepare it for HPC / ML by loading it into highly performant storage (think FSx for Lustre)
    ** Kick off jobs for HPC / Machine Learning services to do analysis.

(S3 event notification -> SNS fan out pattern to SQS… for HPC / ML might want SageMaker or Amazon EMR)

  • If any given genome hasn’t been read for 3 months, we’re done doing analysis of it and it can be deleted.
    (S3 lifecycles)
- Because this is clinical data, it needs to be super secure. It's a potential class action lawsuit if the data leaks.
(s3 encryption, limit IAM access to IAM roles that need it)

See: https://aws.amazon.com/health/genomics/

How would I design this system?

A

(this is all junk)

Before doing anything else - you need to understand what the legal requirements are for handling clinical trial data. At a bare minimum, you’ll want to restrict access to that data so that only services that need to access it can decrypt it or connect to the DB it’s stored in (think IAM Roles).

You probably can’t use a fully-managed service here because you’re running proprietary software.

Use compute optimized instances (which are designed for HPC). To keep the latency as low as possible, create a “Cluster” placement group for them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly