Apache Cassandra Flashcards
What is Apache Cassandra?
Free, open source, distributed, NoSQL database. It is designed to handle large amounts of data on commodity hardware. It is a write heavy database, where it is optimized for more writes.
What is the shape of sharded clusters in Cassandra?
They form a ring structure and most probably use consistent hashing technique.
Cassandra data model
Column family is the way to store and organize data.
Table is a two dimensional view of a multi dimensional column family.
Operations on tables using the Cassandra Query Language.
Though we have to define column family before hand but columns are not fixed and can be added at any time.
Cassandra Database Elements (components)
Cluster - Container of keyspace
Keyspace - Corresponds to database
Column Family - Set of rows with similar structure
CQL Table - Tables in Cassandra Query Language
We can have more than one column as key in Cassandra
Command to create Keyspace in Cassandra
CREATE KEYSPACE ABC WITH replication = {‘class’: ‘SimpleStrategy’, ‘replication_factor’: ‘3’} AND durable_writes = ‘TRUE’;
NoSQL Database types
1) Key-Value stores Amazon DynamoDB Voldemort Citrusleaf Membase, Riak, etc 2) Document Database MongoDB Couch One Terrastore OrientDB 3) BigTable clones BigTable (Google) Cassandra HBase Hypertable 4) Graph Databases FlockDB (Twitter) AllegroGraph DEX InfoGrind Neo4J
What is Dynamo? And who wrote Dynamo White Paper?
Dynamo White Paper was written in Amazon. It is a key-value store and highly available. It was written to solve the cart problem.
It was focused on how do we build a data store that is
1) Reliable
2) Always On
3) Performant
It wasn’t new or something as it cited 24 other white papers.
What is BigTable? And who wrote the white paper?
BigTable is high volume sequential access datastore. It was written by Google.
1) Richer data model
2) 1 key. Lots of values
3) Fast sequential access
4) 38 papers cited.
What is Cassandra? And who wrote the Cassandra white paper?
It was a blend of Dynamo paper and BigTable paper.
It had distributed features of Dynamo
BigData model and storage from BigTable
It was written by Facebook.
Does cassandra cluster nodes share anything?
No, cassandra is based on shared nothing architecture.
Basics of Cassandra Replication
Cassandra is Fully replicated.
Client writes local
Data syncs across WAN
Replication factor per Data Center
This is the differentiating factor between others and Cassandra.
Is replication synchronous or Asynchronous?
Asynchronous
Is there any elected master or leader in Cassandra?
No it does not have any master or slave or elected leaders.
Explain briefly the write process of Cassandra on a single node.
1) Cassandra client fires a query - “update users set firstname=”Patrick” where id=”pmcfadin””
2) As soon as it is received by Cassandra, the first thing its going to do is write a mutation. Just like in Apache Thrift.
3) First it does is it writes the mutation to commit log. This ensures durability. Once data is recived by server, you want the data to be there.
Commit log is append only, so its very fast. If using spinning disk, the spindle will not be doing random seeks. It will just go click click, write data sequentially. So its very fast.
4) Then it is put to memtable. Memtable is strcture based on a table, Users. Memtable is identified by primary key and it will have many columns attached with it (billions) inspired from BigTable. Memtable represents row of storage data, it is stored in memory first.
5) Acknowledge to client. Simple write path. So it’s so fast. Hence scaling is easier.
6) Then when the Memtable starts filling then the contents are flushed to disk. The memtable is written out to file called sstable (sorted string table). It is “IMMUTABLE”. This flushing is sequential write. This is very different than random seeking of relational databases. Cassandra is sequential IO as opposed to random IO of relational databases.
Because of sequential write, order is preserved in disk. When you ask for it, it comes out in sequential read. That’s why timeseries database is so good with Cassandra. The access pattern is also going to be sequential.
7) What happens if same row is updated twice. Previous old record and new record. Now in sstable we will have two records in the sstable. Cassandra is going to lookout for latest timestamp. But the inefficiency is that it will have to choose from multiple records. To solve that it does compaction. Does merge sort on records and finally after merging it writes new compacted sstable.
So disk keeps going up and down, which is normal.
With cassandra you are backloading IO to later, due to requirement of fast read/write. So we change order of IO.
What is sstable?
It is sorted string table, where records are stored sorted in order with time and it is IMMUTABLE. Once written the contents can never be changed. So you have sequential writes on disk as opposed to relational database models where there are random writes/reads.