CIS275 - Chapter 9: NoSQL Databases Flashcards
Structured data created within an organization, with sizes ranging from gigabytes to terabytes, is called _____.
transactional data
Data generated by new internet and multimedia applications is commonly called ____ and differs from transactional data.

big data
big data and differs from transactional data in four ways:
Volume. Typical size ranges from terabytes to petabytes (million billion bytes), occasionally reaching exabytes (billion billion bytes).
Velocity. Big data is generated at extremely high rates. Facebook users upload roughly a billion photos per day, or 10,000 per second. Twitter generates roughly 6,000 tweets per second. Click rates on popular websites can be significantly higher.
Variety. Variety means both unstructured and rapidly changing data types. Unstructured data refers to information embedded in complex data types like images, video, GPS coordinates, and natural language. Rapidly changing data means the information content of records vary greatly, as in data collected from social media. Both unstructured and rapidly changing data are common in big data.
Veracity. Transactional data is typically created by an organization’s employees or trusted partners. Big data is often generated by the general public. Consequently, the accuracy of big data varies much more than transactional data.
- Variety refers to unstructured data, such as text files, video, web logs, social media, and sensor data.
- Variety also refers to variable data structures. Ex: Facebook, LinkedIn, and Twitter contain different information about people, which might be combined in big data.



_____ increases capacity by increasing speed and size of CPUs and storage devices for a limited number of machines.

Vertical scaling
To accommodate increasing database sizes, transactional applications commonly scale vertically, not horizontally.
Vertical scaling increases processing speed, memory, and storage of a limited number of machines.

_____ increases capacity by adding large numbers of low-cost components like standard disk drives and CPUs.

Horizontal scaling
To accommodate increasing database sizes, transactional applications commonly scale vertically, not horizontally.
Horizontal scaling adds an unlimited number of machines working in parallel.

_____ splits large tables into separate physical files on one machine.
Partitioning
Relational databases were developed prior to big data. Historically, which of the following requirements were prioritized by relational databases?


_____ splits data sets across multiple machines.
Sharding


Data is represented as a single key and an associated value. The key is used to access the value.
Key-value database

Key-value systems support a limited set of queries, such as:
put(key, value) - Stores the value in the database, indexed by key.
get(key) - Retrieves the value associated with the key.
multiGet(key1, …, keyn) - Retrieves the values associated with keys 1 through n.
delete(key) - Deletes the value associated with the key.
also called column-based, column family, or tabular database. Data is represented as a key and multiple values. Since each record has multiple values, a descriptive name is stored with each value.

Wide column database

Data is represented as a key and a ‘document’. Usually the document is in a structured, human-readable format such as XML or JSON.
Document database

Data is represented as a graph with nodes and edges.
Graph database

_____ supports the data models of several categories.
multi-model database, also called a hybrid database


Keys are used to identify and locate values. In this example, the key is an email address.
Values are photographs of the person associated with the email address.
Each key is associated with one value.

Key-value logical structure

The put() function stores a value in the database.
The get() function retrieves the value associated with a key.


Values are grouped in hash buckets.
Values are replicated on multiple devices for high availability and fast access.

Key-value physical structure

Updates to values are applied to one replica. For fast updates and high availability, additional replicas are not updated within a transaction.
If other replicas are accessed before an update is propagated, obsolete values are returned.
Eventually, the update is propagated to all replicas.





Expected: Webpage, User
The key must be unique. A webpage domain name and user email are unique, whereas two students may have the same age.

Expected:
Building,
Stock
The key must be unique. A building’s street address and stock symbol are unique, whereas two students may have the same age.

Expected:
Stock,
Employee
The key must be unique. A stock symbol and employee email are unique, whereas two patients may have the same full name.

Expected: Webpage
The key must be unique. A webpage domain name is unique, whereas two students may have the same age, or two employees may have the same title.

Expected:
‘/User/pics/flower3.jpg’ get(‘mike@email.com’) retrieves the value associated with the key ‘mike@email.com’, which is ‘/User/pics/flower3.jpg’.

Expected: New value added, Existing value replaced ‘joe@email.com’ is not a key, so put(‘joe@email.com’, ‘/User/pics/cat4.jpg’) adds a new key ‘joe@email.com’ with value ‘/User/pics/cat4.jpg’. ‘matt@email.com’ is already a key, so put(‘matt@email.com’, ‘/User/pics/puppy7.jpg’) updates the value ‘/User/pics/puppy7.jpg’ for key ‘matt@email.com’.
_____ databases store multiple versions of each value. Each version is marked with the date and time the version is created, called a timestamp.
timestamp

To access older values, the timestamp must be specified in a query. If a query does not specify a timestamp, the database selects the most recent version.
In a wide column database, a specific value is accessed with a combination of table name, key, column family name, column name, and optional timestamp.




All columns of a family are stored together for fast access via the key.
Different column families are physically separated.

Wide column databases are not optimized to access multiple column families within one query.






Expected:
Table name Contact
Key ajf@acm.org
Column family name Description
Column name Status
The table ‘Contact’ contains rows indexed by a key. The table has column families ‘Name’, ‘Address’, and ‘Description’. The columns in each column family can vary from one row to another.
So, the ‘Status’ of ‘Arnold J. Fourier’ is found by looking in the ‘Status’ column of the ‘Description’ column family of the row indexed by the key ‘ajf@acm.org’.

Expected:
Table name Contact
Key ajf@acm.org
Column family name Address
Column name State
The table ‘Contact’ contains rows indexed by a key. The table has column families ‘Name’, ‘Address’, and ‘Description’. The columns in each column family can vary from one row to another.
So, the ‘State’ of ‘Arnold J. Fourier’ is found by looking in the ‘State’ column of the ‘Address’ column family of the row indexed by the key ‘ajf@acm.org’.

Expected: manufacturing
First, the row with key ‘sales@corp.com’ is located. Then, the column family ‘Description’ is accessed. Finally, the value of the column ‘Category’ within the column family ‘Description’, which is ‘manufacturing’, is accessed.
A _____ stores data as a collection of documents.
document database
A document database may contain multiple collections, just as a relational database may contain multiple tables.
- The Flight collection consists of documents describing scheduled airline flights, in JSON format.
- Documents may have a different number of values with different names.
- Usually all documents in a collection share common value names, to facilitate queries.

- The Flight collection consists of documents describing scheduled airline flights, in JSON format.
- Documents may have a different number of values with different names.
- Usually all documents in a collection share common value names, to facilitate queries.

Flight
{
identifier: “cb20896a-eea8-b55c-7a22-08d885640c96”,
FlightNumber: “8809”,
Airline:”United”,
DepartureAirportCode: “JFK”,
ArrivalAirportCode: “ATL”,
}
{
identifier: “bha5678cdbd9e3a587de9b814578dba1”,
FlightNumber: “44”,
Airline: “American”,
DepartureAirportCode: “OAK”,
ArrivalAirportCode: “DFW”,
}
{
identifier: “41b3b38cdbd9e3a587de9b8145111aab”,
FlightNumber: “239”,
Airline:”United”,
DepartureAirportCode: “SFO”,
ArrivalAirportCode: “ORD”
}


Documents are assigned to a shard based on a _____.
shard key
The shard key is either the document identifier or some other value. If the shard key is a value, an index of shard key values is created so the database can quickly locate documents.
With a _____, each shard contains a contiguous range of shard key values.
range function
Ex: If the shard key for the Flight collection is Airline, documents for airlines beginning with ‘A’ might be in one shard, ‘B’ in another, and so on.
- The database designer selects either the identifier or an indexed value as the shard key. Airline is chosen as the shard key.
- Documents can be assigned to a shard based on a hash function on the shard key.

- Alternatively, documents can be assigned to a shard based on a range function.






Expected: No documents are selected
No documents have the key Credits.




a hub where network lines converge.
A vertex, also called a node

a connection between two vertices.
An edge, also called a link

descriptive information associated with vertices and edges.
property
In a _____, edges have a starting and ending vertex and are depicted as arrows.
directed graph

In an _____, edges have no direction and are depicted as lines.
undirected graph





- Vertex labels are collections of objects, like entity types or tables.
- A vertex is an individual object, like an entity instance or table row. An edge is a relationship between individual objects.
- Properties are name-value pairs for vertices and edges.

- Property graphs have flexible schema. Different vertices and edges can have different properties.



- g.addV().property() adds a vertex with label ‘Passenger’ to graph g.
- g.addV().property() adds a vertex with label ‘Flight’ to graph g.

- g.V().addE().to() adds an edge between two vertices.
- out() traverses edges from start to end vertex, like a relational join.



- In a relational database, a relationship is stored as a foreign key value in an index, along with a pointer to the location of the row containing related data.
- With index-free adjacency, a pointer is stored within the start vertex. Queries that traverse edges require fewer reads.

- A pointer is also stored within the end vertex to enable traversal in any direction.






Expected: Jen Choi, $266
Person is an Entity type, so a vertex label. Same for Payment.
Jen Choi is a Person instance, so a vertex.
$266 is a Payment instance, so a vertex.
Kim Soto-Makes-$66 is a connection between two vertices, so an edge.

Expected:
Jan West-Makes-$142
Payment is an Entity type, so a vertex label, not an edge.
Fay Choi-Pays-$152 could not be an edge as Pays is not a relationship type shown in the graph.
The graph is directed and the arrow points from Person to Payment, so $227-MadeBy-Pat Reid could not be an edge.
The graph is directed and the arrow points from Person to Payment, so $91-MadeBy-Tia Hale could not be an edge.
The graph is directed and the arrow points from Person to Payment, so $268-MadeBy-Del Hall could not be an edge.
Jan West-Makes-$142 is a connection between two vertices, so could be an edge.

Expected:
Rob Ross-Teaches-English
Ina West-Teaches-English
Rob Ross-Teaches-English is a connection between two vertices, so could be an edge.
Course is an Entity type, so a vertex label, not an edge.
Instructor is an Entity type, so a vertex label, not an edge.
Ina West-Teaches-English is a connection between two vertices, so could be an edge.
The graph is directed and the arrow points from Instructor to Course, so Databases-TaughtBy-Zoe Rios could not be an edge.
Teaches is a Relationship type, so an edge label.

Expected:
NumberOfTerminals: 5
FlightNumber: 3572
MealPreference is a property name without a value, so not a property.
5 is a property value without a name, so not a property.
FlightNumber is a property name without a value, so not a property. NumberOfTerminals: 5 is a name-value pair, so a property.
Rob Wood is a property value without a name, so not a property.
Default is a property value without a name, so not a property. FlightNumber: 3572 is a name-value pair, so a property.

Expected:
PhoneNumber: (171) 736-1461
Gate: 29
DateOfBirth is a property name without a value, so not a property.
10 is a property value without a name, so not a property.
First is a property value without a name, so not a property. PhoneNumber: (171) 736-1461 is a name-value pair, so a property.
Duration is a property name without a value, so not a property. ArrivalDateTime is a property name without a value, so not a property. Gate: 29 is a name-value pair, so a property.

Expected:
g. addV(‘Flight’).property(‘FlightNumber’, ‘3416’).property(‘AirlineName’, ‘Delta’)
g. V(‘Gus King’).addE(‘Books’).to(g.V(‘3416’))
g. V(‘Gus King’).out(‘Books’)
g.addV(‘Flight’) adds a vertex with label ‘Flight’ to graph g. property(‘FlightNumber’,’3416’) adds name-value pair FlightNumber: 3416 to graph g.
property(‘AirlineName’, ‘Delta’) adds name-value pair AirlineName: Delta to graph g.
g. V(‘Gus King’).addE(‘Books’).to(g.V(‘3416’)) adds an edge between vertices Gus King and 3416.
g. V(‘Gus King’).out(‘Books’) traverses edge ‘Books’, merges data from 3416 and Delta, and displays the result.
- A single student is represented as a document with field:value pairs. The name field is assigned a BSON string, gpa is a double, and interests is an array.
- Documents may be nested. The student document contains a nested address document.

- MongoDB organizes documents into collections. A group of students is stored in a single collection.






