CIS275 - Chapter 9: NoSQL Databases Flashcards
Structured data created within an organization, with sizes ranging from gigabytes to terabytes, is called _____.
transactional data
Data generated by new internet and multimedia applications is commonly called ____ and differs from transactional data.
big data
big data and differs from transactional data in four ways:
Volume. Typical size ranges from terabytes to petabytes (million billion bytes), occasionally reaching exabytes (billion billion bytes).
Velocity. Big data is generated at extremely high rates. Facebook users upload roughly a billion photos per day, or 10,000 per second. Twitter generates roughly 6,000 tweets per second. Click rates on popular websites can be significantly higher.
Variety. Variety means both unstructured and rapidly changing data types. Unstructured data refers to information embedded in complex data types like images, video, GPS coordinates, and natural language. Rapidly changing data means the information content of records vary greatly, as in data collected from social media. Both unstructured and rapidly changing data are common in big data.
Veracity. Transactional data is typically created by an organization’s employees or trusted partners. Big data is often generated by the general public. Consequently, the accuracy of big data varies much more than transactional data.
- Variety refers to unstructured data, such as text files, video, web logs, social media, and sensor data.
- Variety also refers to variable data structures. Ex: Facebook, LinkedIn, and Twitter contain different information about people, which might be combined in big data.
_____ increases capacity by increasing speed and size of CPUs and storage devices for a limited number of machines.
Vertical scaling
To accommodate increasing database sizes, transactional applications commonly scale vertically, not horizontally.
Vertical scaling increases processing speed, memory, and storage of a limited number of machines.
_____ increases capacity by adding large numbers of low-cost components like standard disk drives and CPUs.
Horizontal scaling
To accommodate increasing database sizes, transactional applications commonly scale vertically, not horizontally.
Horizontal scaling adds an unlimited number of machines working in parallel.
_____ splits large tables into separate physical files on one machine.
Partitioning
Relational databases were developed prior to big data. Historically, which of the following requirements were prioritized by relational databases?
_____ splits data sets across multiple machines.
Sharding
Data is represented as a single key and an associated value. The key is used to access the value.
Key-value database
Key-value systems support a limited set of queries, such as:
put(key, value) - Stores the value in the database, indexed by key.
get(key) - Retrieves the value associated with the key.
multiGet(key1, …, keyn) - Retrieves the values associated with keys 1 through n.
delete(key) - Deletes the value associated with the key.
also called column-based, column family, or tabular database. Data is represented as a key and multiple values. Since each record has multiple values, a descriptive name is stored with each value.
Wide column database
Data is represented as a key and a ‘document’. Usually the document is in a structured, human-readable format such as XML or JSON.
Document database
Data is represented as a graph with nodes and edges.
Graph database
_____ supports the data models of several categories.
multi-model database, also called a hybrid database
Keys are used to identify and locate values. In this example, the key is an email address.
Values are photographs of the person associated with the email address.
Each key is associated with one value.
Key-value logical structure
The put() function stores a value in the database.
The get() function retrieves the value associated with a key.
Values are grouped in hash buckets.
Values are replicated on multiple devices for high availability and fast access.
Key-value physical structure
Updates to values are applied to one replica. For fast updates and high availability, additional replicas are not updated within a transaction.
If other replicas are accessed before an update is propagated, obsolete values are returned.
Eventually, the update is propagated to all replicas.
Expected: Webpage, User
The key must be unique. A webpage domain name and user email are unique, whereas two students may have the same age.
Expected:
Building,
Stock
The key must be unique. A building’s street address and stock symbol are unique, whereas two students may have the same age.
Expected:
Stock,
Employee
The key must be unique. A stock symbol and employee email are unique, whereas two patients may have the same full name.
Expected: Webpage
The key must be unique. A webpage domain name is unique, whereas two students may have the same age, or two employees may have the same title.
Expected:
‘/User/pics/flower3.jpg’ get(‘mike@email.com’) retrieves the value associated with the key ‘mike@email.com’, which is ‘/User/pics/flower3.jpg’.
Expected: New value added, Existing value replaced ‘joe@email.com’ is not a key, so put(‘joe@email.com’, ‘/User/pics/cat4.jpg’) adds a new key ‘joe@email.com’ with value ‘/User/pics/cat4.jpg’. ‘matt@email.com’ is already a key, so put(‘matt@email.com’, ‘/User/pics/puppy7.jpg’) updates the value ‘/User/pics/puppy7.jpg’ for key ‘matt@email.com’.
_____ databases store multiple versions of each value. Each version is marked with the date and time the version is created, called a timestamp.
timestamp
To access older values, the timestamp must be specified in a query. If a query does not specify a timestamp, the database selects the most recent version.
In a wide column database, a specific value is accessed with a combination of table name, key, column family name, column name, and optional timestamp.