Explore Core Data Concepts Flashcards

1
Q

Why is data now accessible to nearly every business?

A

Easier to collect and cheaper to host Software technologies and platforms can help facilitate collection, analysis, and storage of valuable information

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is data?

A

Collection of facts such a numbers, descriptions and observations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a common way of classifying data?

A

Stuctured, semi-structured, unstructured

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is structured data?

A

Tabular data represented by rows and columns with predefined data types in a database

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a relational database?

A

Databases that hold tabular data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is semi-structured data

A

Information that doesn’t reside in a relational database but still has some structure to it. Examples: JSON format documents, key-value stores, graph databases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a key-value database

A

Store Associative arrays Key serves as a unique identifier to retrieve a specific value. Value can be anything from a number or a string or a complex object like a JSON file Stores data as a single collection without structure or relation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a graph database?

A

Used to store and query information about complex relationships

Graph contains nodes (information about objects) and edges (information about the relationships between objects)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How is structured data typically stored?

A

Relational database such as SQL Server or Azure SQL Database

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How is unstructured data typically stored?

A

Azure Blob (Binary Large Object)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How is semi-structure data typically stored?

A

Azure Cosmos DB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are teh level of access that users can be given to data?

A

Read-only - can’t modify or create new

Read/write - can view and modify

Owner - can add new users and remove existing users

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the two broad data processing solutions?

A

Transaction processing

Analytical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a transactional system?

A

Records transactions which is the primary function of business computing

Examples include movement of money between accounts in a banking system,

or in a retail system tracking payments for goods and services from customers

A transaction is a small, discrete unit of work

Usually high volume, sometimes handling many millions of transactions in a single day

Often referred to as OLTP (Online Transactional Processing)

Data being processed has to be accessible very quickly

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How is fast processing supported in a transactional system?

A

Data often dividedi into small pieces, i.e. each table involved in a transaction only contains the columns necessary to perform the transactional task, i.e

in for bank transfers, a table holding information about the funds in the account might only contain hte account number and current balance

Splitting tables out into separate groups of columns like this is called normalized

Normalization can enable a transactional system to cache much of the information required to perform transactions in memory and speed throughput

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a downside of transactional systems for querying?

A

Queries involving normalized tables will frequently need to join data across several tables back together again?

This makes it difficult fo rbusiness users who might need to examine the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is an analytical system?

A

Designed to support business users who need to query data and gain a big picture view of the information held in a database

Capture raw data and use it to generate insights

Most need to perform the following tasks:

  • data ingestion
  • data transformation
  • data querying
  • data visualization
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is data ingestion

A

Process of capturing raw data

Can be taken from control devices, point-of-sale devices, weather stations, recording of the movement of money between bank accounts, etc.

Can come from a separate OLTP system

Data needs to be stored in a repository which could be a file store, a document database, or even a relational database

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Why is data transformation/processing needed?

A

Raw data might not be in a format suitable for querying

Data might contain anomolies that should be filtered out or standardized

Data might need to be agregated to KPIs (Key Performance Indicators). Key Performance Indicators are how businesses are measured for growth and performance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Why is data querying needed?

A

Most database management systems provide tools to enable you to perform ad-hoc queries against data and generate regular reports

May be looking for trends or attempting to determine cause of problems in your system

21
Q

Why is data visualization needed?

A

Data represented in tables aren’t always intuitive

22
Q

Describe benefits, problems of relational data as well as solutions to known potential issues

A

Tables with rows and columns

Rigid structure can cause some problems, i.e. how to handle multiple addresses for one person - is it 4 columns because assume can’t be more than that

Normalization will solve these types of problems

Primarily used to handle transaction processing

23
Q

What is normalization? What is the downside

A

In relational data, data is split into a large number of narrow (few columns), well-defined tables with references from one table to another.

Querying data often requires reassembling data from multiple tbles by joining the data back together at run-time. These types of queries can be expensive.

24
Q

Describes non-relational database as well as its positive and negative features

A

Store dat in a format tat more closely matches the original structure,i.e. in a document database

So to retrieve the details of a customer, incuding the address, just need to read a single document

But means information can be duplicated (i.e. 2 customers share an address) and make maintenance more complex (2 documents must be updated when a couple changes address)

25
Q

What is a transaction?

A

A sequence of operations that are atomic.

This means all operations in the sequence must be completed successfully. If something goes wrong, all operations run so far in the sequence must be undone

Each transaction has a defined beginning point, followed by steps to modify the data within the database. At the end, the database either commits the changes to make them permanent or rolls back the changes to the starting point.

Bank transfers are a good example: you deduct funds from one account and credit the equivalent funds to anotehr account/ If the system fails after decducting the funds, they must be reinstated in the original account.

26
Q

What is meant by acronym ACID?

A

A transactional database must adhere to the ACID (Atomicity, Consistency, Isolation, Durability) properties to ensure that the database remains consistent while processing transactions.

27
Q

What is Atomicity in ACID

A

Atomicity: guarantees that each transaction is treated as a single unit, which either succeed completely or fails completely.

If any of the statements constituting a transaction fails to complete, the entire transaction fails and the database is left unchanged.

An atomic system must guarantee atomicity in each and every situation, including power failures, errors and crashes

28
Q

What is meant by Consistency in the ACID properties?

A

Ensurs that a transaction can only take the data in the database from one valid state to another.

A consistent database should never lose or create data in a manner that can’t be accounted for.

For ex. if you add funds to an account, there must be a corresponding deduction of funds somewhere, or a record that describes where the funds have come from if they have been received externally. You can’t suddenly create or lose money

29
Q

What is Isolation in the acronym ACID?

A

Ensures that concurrent execution of transactions leaves the database in the same state that would have been obtained if the transactions were executed sequentially

A concurrent process can’t see the data in an inconsistent state (ie. funds have been deducted from one account but not yet credited to another)

30
Q

What is Durability in the ACID acronym?

A

Guarantees that once a transaction has been committed it will remain committed even if there is a system failure such as a power outage or crash

31
Q

Why makes database systems that process transactional workloads inherently complex?

A

Need to manage concurrent users possibly attempting to access and modify the same data at the same time, processing the transactions in isolation while keeping the database consistent and recoverable

32
Q

How do many transactional systems implement relational consistency and isolation? What are the downsides, if any?

A

Apply locks to data when it is updated.

Lock prevents another process from reading the data until the lock is released. Lock is only released when the transaction commits or rolls back.

Extensive locking can lead to poor performance while applications wait for locks to be released.

33
Q

What is a distributed database?

A

A distributed database is a database in which data is stored across different physical locations. It may be held in multiple computers located in the same physical location (for example, a datacenter), or may be dispersed over a network of interconnected computers.

34
Q

What is a downside of a distributed database?

A

When compared to non-distributed database systems, any data update to a distributed database will take time to apply across multiple locations. If you require transactional consistency in this scenario, locks may be retained for a very long time, especially if there’s a network failure between databases at a critical point in time.

35
Q

How do many distributed database systems deal with the fact that locks can be retained for a very long time?

A

Relax the strict isolation requirements of transactions and implement “eventual consistency.”

36
Q

What is the concept of “eventual consistency”? What is the downside? Under what circumstances is eventual consistency ideal?

A

As an application writes data, each change is recorded by one server and then propagated to the other servers in the distributed database system asynchronously.

Can lead to temporary inconsistencies in the data.

Ideal where the application doesn’t require any ordering guarantees. Examples include counts of shares, likes, or non-threaded comments in a social media system.

37
Q

What are analytical workloads?

A

Typically read-only systems that store vast volumes of historical data or business metrics, such as sales performance and inventory levels.

38
Q

How are analytics workloads used?

A

Analytical workloads are used for data analysis and decision making.

Analytics are generated by aggregating the facts presented by the raw data into summaries, trends, and other kinds of “Business information.”

Analytics can be based on a snapshot of the data at a given point in time, or a series of snapshots. Decision makers usually don’t require all the details of every transaction. They want the bigger picture.

39
Q

How are transactional and analytical workloads related?

A

Transactional information, however, is an integral part of analytical information. If you don’t have good records of daily sales, you can’t compile a useful report to identify trends.

40
Q

What is meant by streaming?

A

Processing data as it arrives

41
Q

What is meant by batch processing of data?

A

Buffering and processing the data in groups

42
Q

What are the advantages and disadvantages of batch processing?

A

Advantages of batch processing include:

  • Large volumes of data can be processed at a convenient time.
  • It can be scheduled to run at a time when computers or systems might otherwise be idle, such as overnight, or during off-peak hours.

Disadvantages of batch processing include:

  • The time delay between ingesting the data and getting the results.
  • All of a batch job’s input data must be ready before a batch can be processed. This means data must be carefully checked. Problems with data, errors, and program crashes that occur during batch jobs bring the whole process to a halt. The input data must be carefully checked before the job can be run again. Even minor data errors, such as typographical errors in dates, can prevent a batch job from running.
43
Q

Give an example of when batch processing should be used

A

Moving data to a data analysis system where the data is not real-time.

44
Q

When should data streaming be used?

A

When new, dynamic data is generated on a continual basis

Ideal for time-critical operations that require an instant real-time response.

For example, a system that monitors a building for smoke and heat needs to trigger alarms and unlock doors to allow residents to escape immediately in the event of a fire.

45
Q

Describe the differences between batch processing and stream processing in terms of:

  • data scope
  • data size
  • performance
  • analysis
A

Data Scope:

  1. Batch processing can process all the data in the dataset.
  2. Stream processing typically only has access to the most recent data received, or within a rolling time window (the last 30 seconds, for example).

Data Size:

  1. Batch processing is suitable for handling large datasets efficiently.
  2. Stream processing is intended for individual records or micro batches consisting of few records.

Performance:

  1. The latency for batch processing is typically a few hours.
  2. Stream processing typically occurs immediately, with latency in the order of seconds or milliseconds. Latency is the time taken for the data to be received and processed.

Analysis:

  1. Typically use batch processing for performing complex analytics.
  2. Stream processing used for simple response functions, aggregates, or calculations such as rolling averages.
46
Q

What does a Database Administrator do?

A

Manage databases, assign permissions to users, store backup copies of data and restore data in case of any failures

  1. Responsible for the design, implementation, maintenance, and operational aspects of on-premises and cloud-based database solutions
  2. Responsible for the overall availability and consistent performance and optimizations of the database solutions
  3. Work with stakeholders to implement policies, tools, and processes for backup and recovery plans to recover following a natural disaster or human-made error
  4. Responsible for managing the security of the data in the database, granting privileges over the data, granting or denying access to users as appropriate.
47
Q

What does a data engineer do?

A

Applying data cleaning routines, identifying business rules, and turn data into useful information.

  1. Collaborate with stakeholders to design and implement data-related assets that include data ingestion pipelines, cleansing and transformation activities, and data stores for analytical workloads.
    • Use a wide range of data platform technologies, including relational and nonrelational databases, file stores, and data streams.
  2. Ensure privacy of data
  3. Manage and monitor data stores and pipelines to ensure data loads perform as expected.
48
Q

What does a Data Analyst Do?

A

Explore and analyze data to create visualizations and charts to enable organizations to make informed decisions.

  • Design and build scalable models
  • Clean and transform data
  • Enabling advanced analytics capabilities through reports and visualizations.
49
Q
A