all things data Flashcards
what is data ?
Data is the raw and unprocessed facts that we capture according to some agreed-upon standards. Data could be a number, an image, an audio clip, a transcription, or similar.
what is information ?
Information is data that has been processed, aggregated, and organized into a more human-friendly format. Data visualizations, reports and dashboards are common ways to present information. (facts revealed by data fitted with context)
what is insight ?
Insight is gained by analyzing data and information in order to understand the context of a particular situation and draw conclusions. Those conclusions lead to actions you can apply to your business
what is the goal of data management ?
enable an organization to get more value from its data, Successfully being able to share, store, protect and retrieve data can be the competitive advantage. It helps to mitigate risks and enables decision making in organizations
COSTS OF POOR DATA MANAGEMENT:
- Misinterpretation of data
- Lost data
- Inaccessible data
- Wasted time and money
- Missed deadlines
which data managements activities are there
- Governance activities
- Lifecycle activities
- Foundational activities
GOVERNANCE ACTIVITIES
= Help control data development and reduce risks associated with data use, while at the same time, enabling an organization to leverage data strategically. The purpose of data governance is to ensure that data is managed properly, according to policies and best practices
What do you need to define A DATA STRATEGY
- Setting data policies
- Data stewardship
- Data ownership
- Data valuation
- Data maturity assessment
- Data classification
- Installing a cultural change
- Principles & ethics
Break down the data strategy develoment in 4 key stages
- Identify
strategic business goals and align planned data initiatives with them - Assess
the current state and maturity of your data management environment - Propose new capabilities, processes and technologies to meet business needs
- Plot out an implementation roadmap and an internal communication plan
what is data stewardship
Data stewardship refers to the management and oversight of an organization’s data. This includes ensuring the quality, accuracy, and security of the data, as well as ensuring that policies and procedures are in place to protect the data. Data stewards are responsible for overseeing the data and ensuring that it is being used appropriately, but they don’t own it.
data ownership
Data ownership refers to the individual or group within an organization that is responsible for the data and its use. Data owners are responsible for ensuring that the data is accurate, complete, and protected, and that it is being used in compliance with legal and regulatory requirements. They also have decision making power on how the data is used and shared.
classify data categories
Public :
Data that may be freely disclosed to the public
Marketing materials, contact information, price lists
Internal Only :
Internal data not meant for public disclosure
Battlecards, sales playbooks, Organizational charts
Confidential :
Sensitive data that if compromised could negatively affect operations
Contracts with vendors, employee reviews
Restricted :
Highly sensitive corporate data that if compromised could put the organization at financial or legal risk.
IP, credit card information, social security numbers, PHI
what are lifecycle activities ?
Lifecycle activities refer to the various stages that data goes through from its creation to its disposal. These stages include data collection, data processing, data storage, data analysis, data visualization, data archiving, and data deletion
what is plan&design , enable&maintain, use&enhance
Plan & design” involves determining the specific data requirements and goals for a project and creating a plan to achieve those goals, including data governance policies and technical infrastructure.
“Enable & maintain” ensures that the data is accurate, accessible, and protected, and manages the day-to-day operations of the data management system.
“Use & enhance” leverages the data for its intended purpose and continuously monitors and evaluates its effectiveness, identifying opportunities to improve or enhance data-driven processes.
what are foundational activites
Foundational activities refer to the basic tasks and processes that organizations must undertake in order to establish a solid foundation for data management. These activities are essential for ensuring the quality, accuracy, and security of the data and are typically the starting point for any data management initiative.
for a good foundation we need :
- Data quality
- Data protection & security
- Risk management
- Data privacy
explain data quality and gigo
GIGO is an acronym for “garbage in, garbage out.” It is a principle that states that if the input data to a system is inaccurate or of poor quality, then the output from that system will also be inaccurate or of poor quality. In other words, if the data that is being used as an input is not accurate or reliable, the output will not be accurate or reliable either. This principle applies to a wide range of systems, including computer systems, data analysis, decision-making processes, and many others.
Data quality= If the data meets the expectations and needs of data consumers
Dimensions of data quality (data quality framework)
Is there enough data?
Data completeness: The proportion of data stored against the potential 100%
Is the data correct?
Data accuracy: The degree to chich the data correctly describes the ‘real world’ object / event
Data validity: Data is valid if it conforms to the syntax (format, type, range) of its definition
How well does the data fit together?
Data consistency: The absence of differences
Data duplication / uniqueness: Data is not unwanted duplicated within or across systems
Is the data up-to-date?
Data timeliness: This dimension refers to the relevance of the data in relation to the time it was collected or the time it is used. Data timeliness is important because it ensures that the data is relevant and useful for the intended purpose.
Impact of poor data quality linked to dimensions
Completeness: Poor data quality in terms of completeness can lead to missing or incomplete information, resulting in inaccurate or unreliable analysis and decision making.
Accuracy: Poor data quality in terms of accuracy can lead to errors and inconsistencies in the data, resulting in incorrect conclusions and decisions.
Timeliness: Poor data quality in terms of timeliness can lead to decisions being based on outdated information, resulting in missed opportunities or wasted resources.
Consistency: Poor data quality in terms of consistency can lead to confusion and misinterpretation of the data, resulting in inconsistent conclusions and decisions.
Validity: Poor data quality in terms of validity can lead to invalid conclusions and decisions, resulting in wasted resources and potential legal and regulatory issues.
Uniqueness: Poor data quality in terms of uniqueness can lead to data duplication, resulting in inconsistent and unreliable analysis, and also inefficiency in the data management process.
how can you manage data quality for data entry
Establish clear guidelines for data entry
- Use of capital letters, special symbols, numbers
- Define required fields
- Make sure the syntax is followed e.g. dates
- Automate data entry / calculations
- Give options to change data
how can you manage data quaity with validation techniques
Data validation = process of
checking data for accuracy &
completeness
Data profiling: define what is critical
data to be complete & accurate
Data cleansing: remove any errors
and discrepancies
Data matching and deduplication:
compare data and look for similar
data
Examples:
* Check if it’s the correct data type /
format
* Check if it is a value from a list of
accepted values
* Check if date is in a specified range
* Check for consistent expressions
e.g. begin & end date
how can you avoid data silo’s to prevent unnecessary data duplication:
Avoid data silos to prevent unnecessary data duplication:
* Centralize data in e.g. a data warehouse or data lake
* Create a data transfer strategy + focus on people, processes
& tools
* Develop a unified view of all your data (data dictionaries,
data models,…)
* Focus on a cloud strategy
how can you keep data up to date ?
Centralize your data, you can ensure that all data is stored in one location, making it easier to manage and update.
Train employees on how to properly enter, manage, and update data is also important.
Incentivize customers & employees to provide accurate and up-to-date data can also be an effective strategy
Integrate open source data, Open-source data can provide additional information and context that can help to improve the accuracy and completeness of your data.
Keep time stamps of data adjustments, it helps to track the data changes over time, and it makes it easier to identify any issues or errors in the data.
Work with multiple opt-ins , you can ensure that your data is accurate, up-to-date, and reliable, and you can also get different perspectives and insights on the data.
what can effective data protection policies and procedures do and what is the goal of it?
= Effective data protection policies and procedures allow the right people to use and update data in the right way, and restrict all inappropriate access and updates.
Goals:
* Access control
* Compliance
* Ensuring that stakeholder requirements for privacy and confidentiality are met
what are the essential elements of a data security policy ?
- Data privacy
- Password management
- Internet usage
- Email usage
- Company-owned devices
- Employee-owned mobile devices
- Social media
- Software copyright and licensing
- Security incident reporting
what is an anti virus software ?
- is a computer program used to prevent, detect, and remove malware.
authentication
– is a process that ensures and confirms a user’s identity. 2-factor authentication is a process in which you ensure the user’s identity in 2 different ways before they get access.
Backup ?
To make a copy of data storeed on a computer or server to reduce the potential impact of failure or loss.
Firewall ?
A firewall is a tool that helps to protect a computer or network from unauthorized access by blocking certain incoming and outgoing connections. It acts as a barrier between a trusted internal network and untrusted external network, such as the internet. It can be thought of as a virtual gatekeeper that controls which data packets are allowed to enter or leave the network.
honeypot ?
Honeypot – A decoy system or network that serves to attract potential attackers.
explain data lingo
Ransomware = disables victim’s access to data until ransom is paid
Fileless Malware = makes changes to files that are native to the OS
Spyware = collects user activity data without their knowledge
Adware = serves unwanted advertisements
Trojans = disguises itself as desirable code
Worms = spreads through a network by replicating itself
Rootkits = gives hackers remote control of a victim’s device
Keyloggers = monitors users’ keystrokes
Bots = launches a broad flood of attacks
Mobile Malware = infects mobile devices
Wiper Malware = A wiper is a type of malware with a single purpose: to erase user data beyond recoverability
Penetration testing = (also called pen testing) is the practice of testing a computer system, network or Web application to find vulnerabilities that an attacker could exploit.
how can malware enters your organisation?
Phishing emails
File attachments
USB sticks
Compromised websites
RDP (Remote desktop protocol)
Stolen credentials & compromised accounts
what is risk management
= the process of identifying, assessing and controlling threats to an organization’s capital and earnings
Prevention is better than cure => Risk scenario
A risk scenario is: a description of a possible event that, when occurring, will have an uncertain impact on the achievement of the enterprise’s objectives. The impact can be positive or negative.
what are GENERIC RISK SCENARIOS FOR INFORMATION
Backup media is lost or backups are not checked for effectiveness
Sensitive information is accidentally disclosed
Sensitive information is disclosed through e-mail or social media
Sensitive data is lost / disclosed through logical attacks
Data is modified intentionally
IP is lost and / or competitive information is leaked due to key team members leaving the enterprise
The enterprise has an overflow of data and cannot deduct the business relevant information from the data (e.g., big data problem).
Which main categories of data are out their?
Reporting: is data organized for the purpose of reporting and business intelligence. Reporting data is created from transactional data, master data, and master reference data.
Transactional: describes business events. It is the largest volume of data in the enterprise
Master: is key business information that supports the transactions. The data on the products & customers supports the transaction
Reference: is a subset of master data that refers to the data that defines the set of permissible values to be used by other data fields
Metadata: is data that describes other data; it is the underlying definition or description of data.
what is master data
Master data is a type of data that is considered the “single source of truth” and is used to identify and describe core business entities. It is data that is used consistently across an organization and is considered to be critical to the organization’s operations. Examples of master data include:
Customer data: This includes information about customers such as their name, address, and contact information.
Product data: This includes information about products such as their name, description, and price.
Employee data: This includes information about employees such as their name, job title, and salary.
what is reference data ?
Reference data is a type of data that is used as a reference for other data. It typically includes a set of codes or values that are used to classify or categorize the data. Reference data is usually used in conjunction with transactional data, which is data that is related to a specific transaction or activity.
Examples of reference data include:
A list of valid product codes for a retail organization
A list of valid country codes for an international organization
A list of valid codes for classifying financial transactions
A list of valid codes for classifying medical diagnoses
what is a data dictionary
= Reference guide on a dataset. The primary goal of a data dictionary is to help data teams understand & trust data assets.Describing the transactional, master & reference data
data types
Text
Number
Date/timestamp
what is data storage
= refers to magnetic, optical or solid state media that records and preserves digital information for ongoing or future operations.
who manages data storage
DBA = Database adminastrator
* Defining storage requirements
* Defining access requirements
* Developing database instances
* Managing the physical storage environment
Loading data
Replicating data
Tracking usage patterns
Managing backup and recovery
Database performance and availability
Data migration
Enabling Data audits and validation
what is data integration & interoperability ?
ata integration refers to the process of combining data from multiple sources into a single, unified view. This can be accomplished through a variety of techniques such as data warehousing, ETL (extract, transform, load) processes, and data federation. The goal of data integration is to make it easier for users to access and analyze the data they need, regardless of where it is stored.
Interoperability, on the other hand, refers to the ability of different systems and applications to work together seamlessly. In the context of data, interoperability means that different systems are able to exchange and make use of data in a consistent and meaningful way. This can be achieved through the use of common data formats, protocols, and standards.
master & reference data management
= Ensure the uniformity, accuracy, stewardship, semantic consistency