Database Specialty - Neptune Flashcards
1
Q
Amazon Neptune – Overview
A
- Fully managed graph database service (non-relational)
- Relationships are first-class citizens
- Can quickly navigate relationships and retrieve complex relations between highly
connected datasets - Can query billions of relationships with millisecond latency
- ACID compliant with immediate consistency
- Supports transaction semantics for highly concurrent OLTP workloads (ACID transactions)
- Supported graph query languages – Apache TinkerPop Gremlin and RDF/SPARQL
- Supports 15 low-latency read replicas (Multi-AZ)
- Use cases:
- Social graph / Knowledge graph
- Fraud detection
- Real-time big data mining
- Customer interests and recommendations (Recommendation engines)
2
Q
Graph Database
A
- Models relationships between data
- e.g. Subject / predicate / object / graph (quad)
- Joe likes pizza
- Sarah is friends with Joe
- Sarah likes pizza too
- Joe is a student and lives in London
- Let’s you ask questions like “identify Londoners who
like pizza” or “identify friends of Londoners who like
pizza”
- Uses nodes (vertices) and edges (actions) to
describe the data and relationships between
them - DB stores – person / action / object (and a graph
ID or edge ID) - Can filter or discover data based on strength,
weight, or quality of relationships
3
Q
Graph query languages
A
- Neptune supports two popular modeling frameworks – Apache
TinkerPop and RDF/SPARQL - TinkerPop uses Gremlin traversal language
- RDF (W3C standard) uses SPARQL
- SPARQL is great for multiple data sources, has large variety of
datasets available - We can use Gremlin or SPARQL to load data into Neptune and
then to query it - You can store both Gremlin and SPARQL graph data on the same
Neptune cluster - It gets stored separately on the cluster
- Graph data inserted using one query language can only be queried
with that query language (and not with the other)
4
Q
Neptune Architecture
A
- 6 copies of your data across 3 AZ (distributed design)
- Lock-free optimistic algorithm (quorum model)
- 4 copies out of 6 needed for writes (4/6 write quorum - data
considered durable when at least 4/6 copies acknowledge the write) - 3 copies out of 6 needed for reads (3/6 read quorum)
- Self healing with peer-to-peer replication, Storage is striped across
100s of volumes
- One Neptune Instance takes writes (master)
- Compute nodes on replicas do not need to write/replicate
(=improved read performance) - Log-structured distributed storage layer – passes incremental
log records from compute to storage layer (=faster) - Master + up to 15 Read Replicas serve reads
- Data is continuously backed up to S3 in real time, using
storage nodes (compute node performance is unaffected)
5
Q
Neptune Cluster
A
- Loader endpoint – to load the data into Neptune (say, from S3)
- e.g. https://<cluster_endpoint>:8182/loader</cluster_endpoint>
- Gremlin endpoint – for Gremlin queries
- e.g. https://<cluster_endpoint>:8182/gremlin</cluster_endpoint>
- Sparql endpoint – for Sparql queries
- e.g. https://<cluster_endpoint>:8182/sparql</cluster_endpoint>
6
Q
Bulk loading data into Neptune
A
- Use the loader endpoint (HTTP POST to the loader endpoint)
- e.g. curl –X POST –H ‘Content-Type: application/json’
https://<cluster_endpoint>:8182/loader –d
‘{
“source”: “s3://bucket_name/key_name,
…
}’</cluster_endpoint>
- e.g. curl –X POST –H ‘Content-Type: application/json’
- S3 data can be accessed using an S3 VPC endpoint (allows access to
S3 resources from your VPC) - Neptune cluster must assume an IAM role with S3 read access
- S3 VPC endpoint can be created using the VPC management console
- S3 bucket must be in the same region as the Neptune cluster
- Load data formats
- csv (for gremlin), ntripples / nquads / rdfxml / turtle (for sparql)
- All files must be UTF-8 encoded
- Multiple files can be loaded in a single job
7
Q
Neptune Workbench
A
- Lets you query your Neptune cluster
using notebooks - Notebooks are Jupyter notebooks
hosted by Amazon SageMaker - Available within AWS console * Notebook runs behind the scenes on
an EC2 host in the same VPC and has
IAM authentication - The security group that you attach in
the VPC where Neptune is running
must have an additional rule that allows
inbound connections from itself
8
Q
Neptune Replication
A
- Up to 15 read replicas
- ASYNC replication
- Replicas share the same underlying
storage layer - Typically take 10s of milliseconds
(replication lag) - Minimal performance impact on the
primary due to replication process - Replicas double up as failover targets
(standby instance is not needed)
9
Q
Neptune High Availability
A
- Failovers occur automatically * A replica is automatically promoted to be the
new primary during DR - Neptune flips the CNAME of the DB instance
to point to the replica and promotes it - Failover to a replica typically takes under 30-120
seconds (minimal downtime) - Creating a new instance takes about 15 minutes (post failover)
- Failover to a new instance happens on a best-effort basis and can take longer
10
Q
Neptune Backup and Restore
A
- Supports automatic backups
- Continuously backs up your data to S3 for
PITR (max retention period of 35 days) - latest restorable time for a PITR can be up
to 5 mins in the past (RPO = 5 minutes) - The first backup is a full backup.
Subsequent backups are incremental - Take manual snapshots to retain beyond 35 days
- Backup process does not impact cluster performance
11
Q
Neptune Backup and Restore
A
- Can only restore to a new cluster
- Can restore an unencrypted snapshot to an
encrypted cluster (but not the other way
round) - To restore a cluster from an encrypted
snapshot, you must have access to the KMS
key - Can only share manual snapshots (can copy
and share automated ones) - Can’t share a snapshot encrypted using the
default KMS key of the a/c - Snapshots can be shared across accounts, but
within the same region
12
Q
Neptune Scaling
A
- Vertical scaling (scale up / down) – by resizing instances
- Horizontal scaling (scale out / in) – by adding / removing up to 15 read replicas
- Automatic scaling storage – 10 GB to 64 TB (no manual intervention needed)
13
Q
Database Cloning in Neptune
A
- Different from creating read replicas – clones
support both reads and writes - Different from replicating a cluster – clones use
same storage layer as the source cluster - Requires only minimal additional storage
- Quick and cost-effective
- Only within region (can be in different VPC)
- Can be created from existing clones
- Uses a copy-on-write protocol
- both source and clone share the same data initially
- data that changes, is then copied at the time it changes either on the source or on
the clone (i.e. stored separately from the shared data) - delta of writes after cloning is not shared
14
Q
Neptune Security – IAM
A
- Uses IAM for authentication and
authorization to manage Neptune
resources - Supports IAM Authentication (with AWS
SigV4) - You use temporary credentials using an
assumed role - Create an IAM role
- Setup trust relationship
- Retrieve temp creds
- Sign the requests using the creds
15
Q
Neptune Security – Encryption & Network
A
- Encryption in transit – using SSL / TLS
- Cluster parameter neptune_enforce_ssl = 1 (is default)
- Encryption at rest – with AES-256 using KMS
- encrypts data, automated backups, snapshots, and replicas in the same cluster
- Neptune clusters are VPC-only (use private subnets)
- Clients can run on EC2 in public subnets within VPC
- Can connect to your on-premises IT infra via VPN
- Use security groups to control access