Database and Information Management Flashcards
Effective Information Systems
Provide users with information that is accurate, timely and relevant
Accurate: free of errors
Timely: available when decision makers need it
Relevant: useful and appropriate for the types of work and decisions that require it
Bit
the smallest unit of data a computer can handle
Byte
a group of bits. Represents a single character, which can be a letter, a number or another symbol
Field
a grouping of characters into a word, a group of words, a complete number (eg. Person’s name or age)
Record
a group of related fields, such as a student’s name, the course taken, the date and the grade.
File
a group of records of the same type
Entity
a person, place, thing or event on which we store and maintain information. (a record describes an entity)
Attribute
each characteristic or quality describing a particular entity.
Data Redundancy and Inconsistency (problems with traditional file processing)
Data Redundancy: duplicate data in multiple files
Data Inconsistency: the same attribute has different values because it’s only updated in some systems but not others – or the same attribute has different names (e.g. Student_ID vs just ID)
Difficult to implement CRM, SCM and ERP systems that integrate data from different sources
Program-Data Dependence (problems with traditional file processing)
coupling of data stored in files and the specific programs required to update and maintain those files – changes to the programs requires changes to the data
e.g. a new software program, which requires changes in the data accessed by the program, e.g. Requires nine-digit ZIP codes instead of five-digit ZIP codes, then when you change the data, other programs which required the five-digit ZIP code will no longer work properly
Lack of Flexibility (problems with traditional file processing)
Traditional file systems can deliver scheduled routine reports after extensive programming efforts, but it cannot deliver ad hoc reports or respond to unanticipated information requirements in a timely fashion
Poor Security (problems with traditional file processing)
Little control and management of data, access to and dissemination of information - management might not know who has access to or who make changes to the organization’s data
Lack of Data Sharing and Availability (problems with traditional file processing)
Information is in different files in different departments, so it is impossible to share and access the information in a timely manner –> and users can find different values for the same piece of information in two different systems, which create distrust to both systems (Data Redundancy and Inconsistency)
Database
A collection of data organized to serve many applications efficiently by centralizing the data and controlling redundant data
E.g. a human resource database with multiple views (so one big file instead of having different files for personnel, payroll and benefits and instead extracting that information as multiple views)
Database Management Systems (DBMS)
A software that enables an organization to centralize data, manage them efficiently and provide access to the stored data by application programs
DBMS provides an interface between application programs and physical data files, i.e. when the application programs calls for a data item, the DBMS finds the item and presents it in the application program
Separates logical and physical view of data
Logical view of data
Presents data as it would be perceived by end users
Physical view of data
How data is actually organized and structured on physical storage media
How does DBMS solve problems of the traditional file environment?
Reduces data redundancy and inconsistency by minimizing isolated files where the same data is repeated (even if there’s redundancy, it can eliminate inconsistency by ensuring it has the same value)
Easier data sharing because the data is in one single location
Relational DBMS
Most popular type of DBMS
Presents data as relations (two-dimensional tables), where each table contains data on an entity and its attributes
Fields in a relational database is also called columns and rows are called records
Key field: the unique identifier for all the information in any row of the table
Primary key: unique key, cannot be duplicated
Foreign key: a lookup field to look up data, and it will be the primary key in another table
Three basic operations of a relational DBMS:
Select: creates subset with the records that meet state criteria
Join: combines relational tables
Project: creates subset consisting of columns in a table, allows the user to create new tables with only the info required
Key field
the unique identifier for all the information in any row of the table
Primary key
unique key, cannot be duplicated
Foreign key
a lookup field to look up data, and it will be the primary key in another table
Three basic operations of a Relational DBMS
Select: creates a subset consisting of all records in the file that meet stated criteria
owe want to select records (rows) from the PART table where the Part_Number equals 137 or 150.
Join: combines relational tables to provide the user with more information than is available in individual tables.
we want to join the now-shortened PART table (only parts 137 or 150 will be presented) and the SUPPLIER table into a single new table.
Project: creates a subset consisting of columns in a table, permitting the user to create new tables that contain only the information required
we want to extract from the new table only the following columns: Part_Number, Part_Name, Supplier_Number, and Supplier_Name
Data definition (as a tool for organizing, managing, and accessing data in the database)
capability to specify the structure of the content of the database
used to create database tables to define the characteristics of the fields in each table
Data dictionary (as a tool for organizing, managing, and accessing data in the database)
where the information about the database is documented
automated or manual file that stores definitions of data elements and their characteristics
Data manipulation language (as a tool for organizing, managing, and accessing data in the database)
used to add, change, delete, and retrieve data in the database
extract data from the database to satisfy information requests
Querying and reporting (as a tool for organizing, managing, and accessing data in the database)
most prominent data manipulation is Structured Query Language (SQL)
Query: a request for data from a database
Conceptual design and physical design (database design)
Conceptual/logical design: an abstract model of the database from a business perspective
Physical design: shows how the database is arranged on direct-access storage devices
Database Normalization (database design)
Structuring a database to reduce redundancy and inconsistency by splitting it up into different logical groups which can be connected with primary / foreign keys to avoid duplicates and data inconsistency
The process of creating small, stable, yet flexible and adaptive data structures from complex groups of data
See model
We split up the different logical groups (of the ORDER before normalization) into separate tables, e.g. Parts, suppliers, orders and line items: for each order, we might have multiple parts so we can store each order number with multiple parts in the line items –> using foreign keys
Referential integrity rules
To ensure relationships between coupled tables remain consistent, relational database systems try to enforce referential integrity rules
For example: When you have a table with a foreign key that points to another table, you may not add a record to the table with the foreign key unless there is a corresponding record in the linked table
Entity-relationship diagram (database design)
How database designers document their data model
See model:
provide
SUPPLIER || —– |< PART || —- |< LINE ITEM >| —- || ORDER
the relationship between the entities SUPPLIER, PART, LINE_ITEM, and ORDER. The boxes represent entities. The lines connecting the boxes represent relationships. A line connecting two entities that ends in two short marks designates a one-to-one relationship. A line connecting two entities that ends with a crow’s foot topped by a short mark indicates a one-to-many relationship
One ORDER can contain many LINE_ITEMs. (A PART can be ordered many times and appear many times as a line item in a single order.) Each PART can have only one SUPPLIER, but many PARTs can be provided by the same SUPPLIER.
Entity-relationship diagram (database design)
How database designers document their data model
See model:
provide is ordered belongs to
SUPPLIER || —– |< PART || —- |< LINE ITEM >| —- || ORDER
is supplied by contains include
the relationship between the entities SUPPLIER, PART, LINE_ITEM, and ORDER. The boxes represent entities. The lines connecting the boxes represent relationships. A line connecting two entities that ends in two short marks designates a one-to-one relationship. A line connecting two entities that ends with a crow’s foot topped by a short mark indicates a one-to-many relationship
One ORDER can contain many LINE_ITEMs. (A PART can be ordered many times and appear many times as a line item in a single order.) Each PART can have only one SUPPLIER, but many PARTs can be provided by the same SUPPLIER.
Nonrelational Database Management System
Uses a more flexible data model and are designed for large data sets across many distributed machines and for easily scaling up or down
Useful for accelerating simple queries against large volumes of structured and unstructured data, including web, social media, graphics and other forms of data that are difficult to analyze with traditional SQL tools
Cloud Databases
Lower cost than in-house database products
Cloud computing vendors who provide relational database engines, such as Amazon Relational Database Service, which offers MySQL, Oracle Database or Amazon Aurora Database Engines
Distributed Databases
One that is stored in multiple physical locations – some parts or copies are physically stored in one location, whereas other parts or copies are maintained in other locations
Blockchain
Distributed database technology that enables firms and organizations to verify transactions on a network nearly instantaneously without a central authority
High encryption
The blockchain maintains a continuously growing list of records called blocks. Each block contains a timestamp and a link to a previous block, and once a block of data is recorded on the blockchain ledger, it cannot be altered retroactively. When someone wants to add a transaction, participants in the network (all of whom have copies of the existing blockchain) run algorithms to evaluate and verify the proposed transaction. Legitimate changes to the ledger are recorded across the blockchain in a matter of seconds or minutes and records are protected through cryptography
Big Data
big data sets with volumes so huge they are beyond the ability of typical DBMS to capture, store and analyze
Extreme VOLUME of data
Wide VARIETY of data
VELOCITY of which data must be processed
No specific quantity to big data, but usually data in the petabyte and exabyte range (billions to trillions records, many from different sources)
Business Intelligence Infrastructure
Data Warehouse: database that stores current and historical data of potential interest to decision makers throughout the company
Data Marts: a subset of a data warehouse in which a summarized or highly focused portion of the organization’s data is placed in a separate database for a specific population of users –> like a decentralized warehouse
Data Lake: handles unstructured (e.g. video data), semi structured (e.g. Twitter feeds) and structured data (e.g. transactional data)
Hadoop: open source software framework (managed by the Apache Software Foundation), which enables distributed parallel processing of huge amounts of data across inexpensive computers
In-Memory Computing: relies primarily on RAM memory (main memory) for data storage, where users’ access data is stored in system primary memory which shortens query response time
Analytic Platforms: Commercial database vendors have developed specialized high-speed analytic platforms using both relational and nonrelational technology that are optimized for analyzing large data sets
Data Warehouse
database that stores current and historical data of potential interest to decision makers throughout the company
consolidates and standardizes information for use across enterprise, but data cannot be altered
Data Marts
a subset of a data warehouse in which a summarized or highly focused portion of the organization’s data is placed in a separate database for a specific population of users –> like a decentralized warehouse
Hadoop Key Services (3 services)
Hadoop Distributed File System (HDFS): for data storage, it links together the file systems on the numerous nodes in a Hadoop cluster to turn them into one big file system
MapReduce: for high performance parallel data processing, breaks down processing of huge data sets and assigns work to the various nodes in a cluster
Hbase: Hadoop’s nonrelational database, provides rapid access to the data stored on HDFS and a transactional platform for running high-scale real time application
Online Analytical Processing (OLAP)
Support multidimensional data analysis, enabling users to view the same data in different ways using multiple dimensions
Requires you to already know what you want from the data (which question you want answered) users can answer ad hoc questions quickly: e.g. a product manager could use a multidimensional data analysis tool to learn how many washers were sold in the East in June, how that compares with the previous month and the previous June, and how it compares with the sales forecast
See model
Data Pipelines (analytical application)
Data engineers build to make sure you can access specific information
Data mining
Provides insight into corporate data that cannot be obtained with OLAP by finding hidden patterns and relationships in large databases and inferring rules from them to predict future behavior
Associations (information available from data mining)
Occurrences linked to a single event
E.g. study of supermarket purchasing patterns might reveal that, when corn chips are purchased, a cola drink is purchased 65 percent of the time, but when there is a promotion, cola is purchased 85 percent of the time
Sequences (information available from data mining)
Events are linked over time
if a house is purchased, a new refrigerator will be purchased within two weeks 65 percent of the time, and an oven will be bought within one month of the home purchase 45 percent of the time
Classifications (information available from data mining)
recognize patterns that describe the group to which an item belongs by examining existing items that have been classified and by inferring a set of rules
businesses such as credit card or telephone companies worry about the loss of steady customers. Classification helps discover the characteristics of customers who are likely to leave and can provide a model to help managers predict who those customers are so that the managers can devise special campaigns to retain such customers
Clusters (information available from data mining)
similar to classification, when no groups have yet been defined
A data mining tool can discover different groupings within data, such as finding affinity groups for bank cards or partitioning a database into groups of customers based on demographics and types of personal investments.
Forecasts (information available from data mining)
Uses a series of existing values to forecast what other values will be
forecasting might find patterns in data to help managers estimate the future value of continuous variables, such as sales figures
Text mining tools
helps businesses analyze unstructured data by extracting key elements from unstructured natural language text, discovering patterns and relationships and summarizing the information
Sentiment analysis software (web mining tool)
mine text comments in an email message, blog, social media conversation or survey forms to detect favorable and unfavorable opinions about specific subjects
E.g. Use sentiment analysis to tune in to consumer conversations about its products across social networks, blogs and other websites
Web mining (+ three types: content-, structure-, and usage mining)
The discovery and analysis of useful patterns and information from the World Wide Web
Content mining: process of extracting knowledge from the content of web pages - includes text, image, audio and video data
Structure mining: examines data related to the structure of a particular website (E.g. links pointing to a document indicate the popularity of the document, while links coming out of a document indicate the richness or perhaps the variety of topics covered in the document)
Usage mining: examines user interaction data recorded by a web server whenever requests for a website’s resources are received
Linking internal databases to the web and the advantages of using web access to internal databases
Users access an organization’s internal database through the web using their desktop PC browsers or mobile apps
See model
The web browser software is easier to use than proprietary query tools
The web interface requires few or no changes to the internal database
It costs much less to add a web interface in front of a legacy system than to redesign and rebuild the system to improve user access.
Data Governance (Information Policy)
Deals with policies and processes for managing availability, usability, integrity, and security of data, especially regarding government regulations
(Encompasses policies and procedures to manage data – the rules for sharing, disseminating, acquiring, standardizing, classifying and inventorying information)
Database administration (Information Policy)
Creating and maintaining databases
Data administration (Information Policy)
establishes policies and procedures to manage data
Information Policy
Rules, procedures, roles for sharing, managing, standardizing data
Data administration: establishes policies and procedures to manage data
Data Governance: Deals with policies and processes for managing availability, usability, integrity, and security of data, especially regarding government regulations
Database administration: Creating and maintaining databases
Data quality audit
A structured survey of the accuracy and level of completeness of the data in an information system
Data cleaning / data scrubbing
Activities for detecting and correcting data in a database that are incorrect, incomplete, improperly formatted, or redundant
You can get data cleansing software for automatically surveying data files and correcting errors and integrating a consistent companywide format
Structured Query Language (SQL)
Created to manipulate databases
Standard way of interacting with RDBs (Relational Databases