session 6: data knowledge mgmt Flashcards
database
is a collection of related data files or tables that contain data
difficulties in managing data (10)
1) data increase exponentially with time
2) data are scattered throughout org.
3) multiple sources of data
4) data become outdated
5) data media rots
6) data security/quality/integrity may be compromised
7) new sources of data
8) legal requirements need to be met with appropriate data-storage methods
9) lefacy IT systems/functional requirement may results in redudancy or inconsistency
10) high volumes of big data + variety of data collected increase in complexity
sources of data
internal sources: corporate database, company docs…
personal sources: personal thoughts, opinions…
external sources: commercial database, gov. reports, coprorate website…
new sources: blogs, podcats, tweets etx
clickstream data
data that visitors and customers produce when they visit a website and click on hyperlinks
Data governance (subset of IT governance)
an approach to managing info across an entire organization
data governance objective
enable available, transparent, useful data => single version of the truth
data governance involves…
provides a planned approach to data mgmt for all types of data
includes a formal set of business processes for data handling
requires well-defined unambiguous rules +> which address creating, collecting, handling, protecting data
master data mgmt
process that spans all of an organization’s businsess processes and applications
master data mgmt goal
goal : effecitvely store, maintain, exchange and synchronize master data
provide consistency, accuracy, timeliness, up-to-date master data
master data def
set of core data such as customer, product employee, vendor etc
stored in a master file or as tables as part of the database
transactional data def
generated and captured by operational systems describe the business’s activities
represents activtiies or events (payroll cheques, customer invoice etc)
stored in transaction files or as table in the database
big data def
collection of data that is so large and complex that it is difficult to manage using traditional database mgmt systems
characteristics of big data
exhibit variety
include unstructured/structured/ semi-structured data
generated at high velocity with an uncertain pattern
do not fit neatly into traditional, structured, relational databases
can be captured, processed, transformed and analyzed in a reasonable amount of time
sources of big data
traditional enterprise data (customer info, web sotre transactions…)
machine-generated/sensor data (smart meters, manufacturing sensors…)
social data (feedback comments…)
images captured by billions of devices
big data 3V
volume
velocity
variety
Issues with big data
come from untrusted sources
big data is dirty (innacurate, incomplete, incorrect etc)
changes
data warehouses def
repository of historical data organized by subject to support decision makers
data mart def
low cost, scaled down versions of data warehouse designed for end-users needs in a startegic business unit
Query by example (QBE)
method of creating database queries that allows users to search for doc based on an example in the form of a selected string of text
characteristics of data warehouses and data marts (6)
organized by business dimension or subject
use online analytical processing
integrated
time variant
nonvolatile
mutlidimensional
ETL
extract, transform, load
generic data warehouse environment
source systems : provide data to the warehouse or mart
data integration technology and processes:: prepare data for use
storing data: handled by variety of architectures
metadata: data about data
data quality issues; data cleansing needs to be used to ensure data meets user’s needs
BI: establishing ppl, comittees/processes to maintain data warehouses
users: business value for users rises when data can be accessed quickly
data lakes def
central repository that stores all of an organization’s data, regardless of its source or format
information silo def
an info system that does not communicate with other related info systems in an organization
how companies can use big data to gain a competitive advantage
strategies:
make big data available
use big data to conduct experiments
micro-segmentation of customers
creating new business models
use in functional areas of org. :
human resources (employee benefits, hiring..)
product development (capture customer preferences…)
operations (analyze data to make operations more efficient)
marketing (better understanding customers…)
gov operations
architectures for data mart and data warehouses
one central enterprise data warehouse (without data marts)
independent data marts: data marts store data for a single application or a few
hub and spoke: contains a central data warehouse that stores the data plus multiple independent data marts that source their data from the central repository
benefits of data warehousing
- end user can access needed data quickly and easily through web browsers because these data are located in one place
- end users can conduct extensive analysis with data in ways that were not previously possible
- end users can obtain consolidated view of organizational data
data warehouse and data lakes differences
data:
warehouse:
relational from transactional systems, operational databases and the lines of business apps
lakes:
non-relational and relational from IoT devices, websites, mobile apps, social media and corporate applications
schema:
warehouse:
designed propr to the DW implementation (schema-on-write)
lake:
written at the time of analysis (schema-on-read)
price/performance
warehouse:
fastest query results using higher cost storage
lake:
query results getting faster using low-cost storage
data quality:
warehouse:
highly curated data that serves as the central version of the truth
lakes:
any data that may or may not be curated
users:
warehouse:
business analysts
lakes:
data scientists, data developers, business analysts
analytics:
warehouse: batch reporting, BI and visualizations
lake: machine learning, predictive analytics, data discovery and profiling