Cassandra Flashcards

(32 cards)

1
Q

KeySpace

A

Keyspace is the outermost container for data in Cassandra. The basic attributes of a Keyspace in Cassandra are −
Replication factor − It is the number of machines in the cluster that will receive copies of the same data.
Replica placement strategy − It is nothing but the strategy to place replicas in the ring.
Column families − Keyspace is a container for a list of one or more column families. A column family, in turn, is a container of a collection of rows. Each row contains ordered columns. Column families represent the structure of your data. Each keyspace has at least one and often many column families.

https://www.tutorialspoint.com/cassandra/images/keyspace.jpg

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Link

A

https://www.tutorialspoint.com/cassandra/cassandra_data_model.htm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

KeySpace creation

A

CREATE KEYSPACE Keyspace name
WITH replication = {‘class’: ‘SimpleStrategy’, ‘replication_factor’ : 3};

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Column Family

A

A column family is a container for an ordered collection of rows. Each row, in turn, is an ordered collection of columns.
In Cassandra, although the column families are defined, the columns are not. You can freely add any column to any column family at any time.
Unlike relational tables where a column family’s schema is not fixed, Cassandra does not force individual rows to have all the columns.
https://www.tutorialspoint.com/cassandra/images/cassandra_column_family.jpg

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Column Family (or Table)

A

Tables store data in rows and columns, but unlike relational databases, each row can have different columns.
Cassandra does not enforce foreign keys or joins.
Each row must have a PRIMARY KEY (Partition Key + Optional Clustering Columns).

CREATE TABLE users (
user_id UUID PRIMARY KEY,
name TEXT,
email TEXT,
created_at TIMESTAMP
);

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Partitioning Key or Clustering Key

A

Partition Key (Required) → Determines which node stores the data.
Clustering Columns (Optional) → Defines row sorting within a partition.

CREATE TABLE orders (
user_id UUID, – Partition Key (distributes data across nodes)
order_id UUID, – Clustering Column (orders sorted by order_id)
item TEXT,
price DECIMAL,
order_date TIMESTAMP,
PRIMARY KEY (user_id, order_id) – Compound Primary Key
);

Partition Key (user_id) ensures all orders of a user are stored together.
Clustering Column (order_id) sorts orders within each user’s partition.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Denormalization and Data Duplication

A

Cassandra denormalizes data instead of using joins. Data is modeled for queries, not for normalization.
CREATE TABLE user_orders (
user_id UUID,
order_id UUID,
product_id UUID,
product_name TEXT,
quantity INT,
price DECIMAL,
order_date TIMESTAMP,
PRIMARY KEY ((user_id), order_id, product_id) – Partitioned by user_id
);
Orders are partitioned by user_id (ensuring all a user’s orders are together).
Data redundancy helps eliminate joins, improving query speed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Primary Key

A

Simple Primary Key - PRIMARY KEY (user_id)
Composite Primary Key - PRIMARY KEY (user_id, order_id) - Uses a Partition Key + Clustering Column, allowing multiple rows per partition.
Compound Primary Key - PRIMARY KEY ((user_id), order_id, product_id) - Uses Partition Key (user_id) + Multiple Clustering Columns (order_id, product_id) to organize data within partitions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Super Column

A

Super Columns are now deprecated, and the preferred way to model hierarchical data in Cassandra is by using tables with composite primary keys.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Query Patterns

A

query-first approach to start designing the data model for an application.
Q1. Find hotels near a given point of interest.

Q2. Find information about a given hotel, such as its name and location.

Q3. Find points of interest near a given hotel.

To name each table, you’ll identify the primary entity type for which you are querying and use that to start the entity name. If you are querying by attributes of other related entities, append those to the table name, separated with by. For example, hotels_by_poi.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Artem Chebotko

A

Way to represent Cassandra Data model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Wide Partition Pattern/ Wide Row pattern

A

The essence of the pattern is to group multiple related rows in a partition in order to support fast access to multiple rows within the partition in a single query.

Basically all data related to the partition is in multiple rows in the same partition.

let’s now consider how to support query Q4 to help the user find available rooms at a selected hotel for the nights they are interested in staying. Note that this query involves both a start date and an end date. Because you’re querying over a range instead of a single date, you know that you’ll need to use the date as a clustering key. Use the hotel_id as a primary key to group room data for each hotel on a single partition, which should help searches be super fast. Let’s call this the available_rooms_by_hotel_date table.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Time Series Pattern

A

The time series pattern is an extension of the wide partition pattern. In this pattern, a series of measurements at specific time intervals are stored in a wide partition, where the measurement time is used as part of the partition key. This pattern is frequently used in domains including business analysis, sensor data management, and scientific experiments.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Use Cases

A

Cassandra is well-suited for use cases requiring high availability, fault tolerance, and linear scalability, such as real-time analytics, IoT data management, and content delivery systems. Its decentralized architecture and support for multi-data center replication make it ideal for applications requiring continuous availability and resilience to hardware failures.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Cassandra Schema

A

CREATE KEYSPACE reservation WITH replication = {‘class’:
‘SimpleStrategy’, ‘replication_factor’ : 3};

CREATE TYPE reservation.address (
street text,
city text,
state_or_province text,
postal_code text,
country text );

CREATE TABLE reservation.reservations_by_confirmation (
confirm_number text,
hotel_id text,
start_date date,
end_date date,
room_number smallint,
guest_id uuid,
PRIMARY KEY (confirm_number) )
WITH comment = ‘Q6. Find reservations by confirmation number’;

CREATE TABLE reservation.reservations_by_hotel_date (
hotel_id text,
start_date date,
end_date date,
room_number smallint,
confirm_number text,
guest_id uuid,
PRIMARY KEY ((hotel_id, start_date), room_number) )
WITH comment = ‘Q7. Find reservations by hotel and date’;

CREATE TABLE reservation.reservations_by_guest (
guest_last_name text,
hotel_id text,
start_date date,
end_date date,
room_number smallint,
confirm_number text,
guest_id uuid,
PRIMARY KEY ((guest_last_name), hotel_id) )
WITH comment = ‘Q8. Find reservations by guest name’;

CREATE TABLE reservation.guests (
guest_id uuid PRIMARY KEY,
first_name text,
last_name text,
title text,
emails set,
phone_numbers list,
addresses map<text,
frozen<address>,
confirm_number text )
WITH comment = ‘Q9. Find guest by ID’;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Built In Data Type

A

ascii, bigint, blob, Boolean, counter, timestamp, varchar,

17
Q

Built in Data Types - Collections

A

list, map, set
Cassandra collection cannot store data more than 64KB.

18
Q

List

A

Create table University.Teacher
(
id int,
Name text,
Email set<text>,
Primary key(id)
);</text>

insert into University.Teacher(id,Name,Email) values(l,’Guru99’,{‘abc@gmail.com’,’xyz@hotmail.com’});

19
Q

User Defined Type

A

One of the advantages of User Defined Data(UDT) is to Attach multiple data fields to a column. In Cassandra, UDTs play a vital role which allows group related fields (such that field 1, field 2, etc.) can be of data together and are named and type.

20
Q

UDT Syntax

A

Syntax to define UDT:
CREATE TYPE UDT_name
(
field_name 1 Data_Type 1,
field_name 2 Data_Type 2,
field_name 3 Data_Type 3,
field_name 4 Data_Type 4,
field_name 5 Data Type 5,
);

21
Q

UDT Example

A

CREATE TYPE Emp.Current_add
(
Emp_id int,
h_no text,
city text,
state text,
pin_code int,
country text
);
CREATE TABLE Registration
(
Emp_id int PRIMARY KEY,
Emp_Name text,
current_address FROZEN<Current_add>
);
INSERT INTO Registration (Emp_id, Emp_Name, current_address )
values (1000, 'Ashish', { h_no :'A 210', city : 'delhi', state : 'DL', pin_code
:12345, country :'IN'});</Current_add>

22
Q

FROZEN keyword

A

A column whose type is a frozen collection (set, map, or list) can only have its value replaced as a whole. In other words, we can’t add, update, or delete individual elements from the collection as we can in non-frozen collection types.
Thanks to freezing, we can use a frozen collection as the primary key in a table.

23
Q

Naming Conventions

A

Key Space - Use lowercase and underscore-separated names. Ex ecommerce_app or user_profiles
Table Naming - Lowercase, nouns, pluralized, underscore-separated. users user_orders product_inventory
Column - descriptive and lowercase - Ex signup_date
Primary Key : id-based, clear names Ex user_id, post_id
Materiazlized View - X_by_Y pattern - Ex orders_by_user

24
Q

X_by_Y naming pattern for Tables

A

Use it when the table is designed for a specific query pattern:
Since Cassandra is query-driven, you often design denormalized tables optimized for a particular access pattern. Naming them as x_by_y (e.g., orders_by_customer) helps clarify what the table is for.
orders_by_user – for querying all orders for a given user
When not to use x_by_y
The table is the canonical (master) version of the entity (e.g., users, products)

25
X_by_Y interpretation
"This table gives me X (data) grouped/indexed by Y (key or dimension)" The X in x_by_y is not the primary key — it’s usually the type of data stored in the table. The Y represents the query key (i.e., how you’re accessing or grouping that data), which often is the primary (partition or clustering) key. orders_by_user - here user_id is the partition_key and order_id is the clustering key user_id, order_id, order_date, total_amount What X is: orders → the data What Y is: user_id → the key used to fetch it
26
Selecting by Clustering Key
In Apache Cassandra, you cannot select data only by the clustering key — you must include the full partition key in your WHERE clause when querying. Partition Key is Mandatory in WHERE Clause ALLOWED - SELECT * FROM orders_by_user WHERE user_id = '123' AND order_id = 'abc'; Not Allowed - SELECT * FROM orders_by_user WHERE order_id = 'abc'; Why This Restriction? To find data using a clustering key alone would require scanning all partitions, which violates Cassandra’s scalability principles WorkAround Create a denormalized table like: CREATE TABLE orders_by_id ( order_id text PRIMARY KEY, user_id text, ... ); This lets you do: SELECT * FROM orders_by_id WHERE order_id = 'abc'; This is the preferred way in Cassandra — model your tables around your queries.
27
Manual Denormalization
Cassandra, you must manually populate all the relevant tables when a new order is received. Cassandra does not support joins or auto-population across tables, so if you have multiple denormalized tables (which is standard practice), your application is responsible for writing the same data to all necessary tables — this is known as manual denormalization. Let’s say the backend receives a new order. You must insert into the following tables: orders_by_user orders_by_id order_items_by_order orders_by_restaurant
28
Manual Denormalization - Transaction Safety
Wrap all inserts in a batch to ensure atomicity (at least at a partition level): BEGIN BATCH INSERT INTO orders_by_user (...) VALUES (...); INSERT INTO orders_by_id (...) VALUES (...); INSERT INTO order_items_by_order (...) VALUES (...); INSERT INTO order_items_by_order (...) VALUES (...); INSERT INTO order_items_by_order (...) VALUES (...); INSERT INTO orders_by_restaurant (...) VALUES (...); APPLY BATCH;
29
UUID in Cassandra
In Cassandra, UUID is a built-in data type used to store Universally Unique Identifiers, commonly used for: Primary keys Identifiers for rows or objects (e.g., user_id, order_id, checkout_id) It is Randomly generated CREATE TABLE users ( user_id uuid PRIMARY KEY, name text, email text ); INSERT INTO users (user_id, name, email) VALUES (uuid(), 'John Doe', 'john@example.com');
30
Partition Key Uniqueness
No, the partition key in Cassandra does not have to be unique. Instead, the combination of the partition key and the clustering key(s) must be unique and together they form the primary key.
31
Primary Key Uniqueness
Primary Key = (Partition Key) + Clustering Key(s) The entire primary key must be unique. if you try to insert a row with the same full primary key? The row is overwritten, not duplicated. Cassandra performs an upsert: INSERT INTO orders_by_user (...) VALUES (...); -- Upserts if PK matches
32
Partition Key as Primary Key
You can mark just the partition key as the primary key. But you cannot have multiple rows for that partition key unless you add a clustering key. CREATE TABLE user_profiles ( user_id UUID, name text, email text, PRIMARY KEY (user_id) -- same as above ); This is valid and means the partition key is the entire primary key. Same as this syntax CREATE TABLE user_profiles ( user_id UUID PRIMARY KEY, name text, email text );