All Flashcards
Define Statistics
The art, language and science of data.
What is synonymous with Domain Knowledge
Business/context understanding.
Define Data
The raw, unorganised facts used in analysis.
Define Information
Data which has been processed to make it useful.
Define Knowledge
Understanding of the information.
List three common data formats
CSV
XML
RTF
Define Open Data
Data which may have no copyright or referencing requirement. E.g open-source software like R.
Define Public Data
Data within the public domain. Free to use, but still has ownership and restrictions.
Define Proprietary Data
Opposite of public data. Private IP of a company.
Define Operational Data
Used in the day-to-day activities of a business, e.g. customer records.
Define Administrative Data
Data used to make informed decisions, often the subject of analysis.
Define Structured and Unstructured Data
Structured data has a well defined model. It’s easy to tabularise.
Unstructured data has no defined model.
Types of Quantitative Data
Discrete/categorical are numeric variables which can only take specific values, which can be counted between.
Continuous is data which can take any value within the interval.
Types of Qualitative Data
Nominal is label data with no order.
Ordinal is label data which can be ordered.
Binomial is a binary data label, e.g. TRUE/FALSE.
What are the stages of the Data Lifecycle?
Created Initial storage Archived Obsolete Deleted
How do Databases and Structured Data relate?
A database is a repository of structured data.
What is a Relational Database?
A large grouping of schemes, tables, queries, reports, views and other elements.
Explain Tables in the relational model
In the relational mode, every relation must have a header (columns) and body (rows).
Define Keys
Designated columns within a table with which the data can be ordered and linked.
What are some examples of Semi-Structured data?
XML and csv are technically semi-structured, as some processing is required to get them into table form.
Define Big Data
Sets of data which are beyond the capabilities of traditional data processing software. They must be analysed computationally.
What are the four Vs of Big Data?
Volume
Variety
Velocity
Veracity
What are Requirements?
The constraints placed on an analysis project, usually determining the data to analyse. Aims to establish the purpose of the project.
What is Explicit Knowledge?
Knowledge that can easily and swiftly be articulated to other people and is usually stored somewhere.
What is Tacit Knowledge?
Knowledge that cannot be readily articulated to other people, may be assumed and may not be stored.
What is Elicitation?
A proactive activity, where the analyst initiates conversations with stakeholders to gain an understanding of the problem.
What are some techniques of Requirement Elicitation?
Interviewing
Observing
Recounting
Apprenticing
What is Recounting
The method of having multiple stakeholders articulate their requirements. Aims to identify misunderstandings, assumptions and reach consensus.
What is the difference between Requirements Elicitation and Gathering?
Requirements gathering is a reactive activity - data exists and must be collected and analysed.
Elicitation is a proactive activity. The analyst initiates conversations with stakeholders to gain an understanding of their problem.
What are some Elicitation challenges?
Problems of scope - customers give ill-defined or unnecessary requirements.
Problems of volatility - requirements change over time.
Problems of understanding - customers unsure of what is needed and the capabilities in their computing environment.
What are some Elicitation solutions
Visualisation Consistent language Guidelines Consistent use of templates Documenting dependencies
What are the Elicitation guidelines?
Assess business + technical feasibility.
Identify requirement specifiers and their bias.
Define technical environment.
Identify domain constraints
Select 1+ Elicitation techniques.
Encourage participation from many stakeholders.
Identify ambiguous requirements for prototyping.
Use usage scenarios to help customers better identify their key requirements.
What is the difference between Validation and Verification?
Validation judges the accuracy of something, eg 50% of company records are compliant.
Verification is concerned with meeting standards in absolute terms, eg the company records are not compliant.
Define the types of Data Models
Conceptual - high-level mappings of database elements and the relationships between them. Identifies info to collect, attributes and class relationships.
Logical - converts business requirements into a model. Revolves around customer need, rather than technical needs. eg a flow diagram.
Physical - a full server model diagram, showing the detail of the database. Shows constraints, eg keys and check constraints.
Define Check Constraints
Check whether an attribute meets a certain requirement.
Define Quality
The standard of something when compared to other things of a similar kind.
For data, quality doesn’t need to be perfect - just high enough for the specific analysis.
What are the 8 principles of the Data Protection Act?
Used fairly and lawfully.
Used for limited, specifically stated purposes.
Used in a way that’s adequate, relevant and not excessive.
Accurate.
Kept for no longer than absolutely necessary.
Handled according to people’s data protection rights.
Kept safe and secure.
Not to be transferred outside the EEA.
Under the Data Protection Act, for what do stronger legal protections exist?
Race, ethnic background, political opinions, religious beliefs, TU membership, genetics, biometrics, health, sexual orientation.
What are the 8 rights under GDPR?
Right to be informed. Right of access. Right of rectification. Right of erasure. Right to restrict processing. Right to data portability. Right to object. Right in relation to automated decision making and profiling.
Which acronym gives the fundamentals of Data Security?
CIA
Confidentiality
Integrity
Availability
What are the reasons for Dirty Data?
Data is missing. Data is incorrect. Incorrectly formatted. Entered into wrong fields. Stale (out of data). Missing links, eg relationship. Duplicated.
What are the sources of Data Error?
Completeness - does not capture the entire problem.
Uniqueness - no duplicates.
Timeliness - data is available when expected and needed.
Accuracy - data reflects reality.
Consistency - providing the same data for the same data object.
Conformity - the data follows the required format.
How can Data Error be avoided?
Process - Put greater controls around data creation.
Entry - have independent checking of drop-down lists to ensure correct data entry.
Identification - searching for errors in data.
Validate - automatically or manually check accuracy in data.
What are the steps in the Data Analysis Process?
Problem hypothesis Identify what to measure Collect data Cleanse data Model data Visualise data Analyse data Interpret results Document/communicate results
Define a Hypothesis
A possible explanation for something, which serves as a starting point for further investigation.
What’s the difference between H0 and H1?
Null hypothesis is the default assumption, that nothing has changed.
Alternative hypothesis is the prediction you make, can be considered the case if H0 is disproven.
Define Data Accessibility
Data in a format that is easy to handle/manage. Similar to data quality.
Define Data Extraction
Adding further structure to data. Yields usable data from unstructured data.
What are the types of Data Cleansing?
Filtering - data is included based on a Boolean condition.
Interpolation - using other data points to fill in the gaps.
Masking - hides certain data from view by unauthorised people, but still allows analysis to occur.
Blending - Combining data from different sources into a single dataset. May be warehoused.
Transformation - changing data from one format/structure to another.
What is an ETL process?
Extract Transform Load
Define the source. Define the target. Define the mapping. Create the session. Create the workflow.
Define Data Models
Mathematical abstractions of reality. Seek to capture relationships between variables.
Date = Model + Error
Explain Inferential Statistics
A branch of statistics which quantifies relationships (rather than descriptive statistics).
Correlation quantifies strength of linear trend.
Hypothesis testing asses the significance of patterns in data.
Regression analysis models trends.
What are some types of Data Visualisation?
Infographics
Time series
Part to whole
Geospatial
Define Data Analysis
Deriving insight and meaning from data. Includes assessing trends and correlations.
Define a Variable (data structure)
A reference to a particular location in a computer’s memory (an address).
Define an Array (data structure)
A sequence of slots of memory, where each slot contains an element (value or object). Deleting and inserting can be slow - it will change the address of all elements in the array.
Define a List (data structure)
Similar to arrays, but permit elements of more than one data type. Values can be inserted/deleted without changing the address of other elements.
Define a Class (data structure)
A data structure containing data fields. It offers a blueprint defining the variables common to an object.
Define a Tree (data structure)
Shows a hierarchical data structure. The top node is called the root. Faster than arrays when inserting and deleting, but slower the linked lists.
Define a Record (data structure)
A value that contains other values. Aka a tuple of struct. (A row in a table).
Define a Schema
A database design including conceptual, logical and physical considerations.
What are the types of Schema?
Conceptual schema - a representation of an organisation, showing the entities, attributes and relationships.
Logical schema - the natural successor, articulates data structures, eg tables, objects and shows relationships.
Physical schema - successor to the logical schema, includes precise detail on the database structure.
What is a Relational Database?
It breaks data into multiple tables. Tables linked through primary and foreign keys.
What is a Relational Database?
It breaks data into multiple tables. Tables linked through primary and foreign keys.
What is a Flat File Database?
Before relational, all data was stored in a single table (eg a spreadsheet).
What is a Hierarchical Database?
Organised into a tree structure. Parent records can have many child records. Each child record can have one parent record. Still widely used for certain functions.
What is a Network Database?
Aims to boost the flexibility of hierarchical databases by allowing many-many relationships between records.
Still less flexible than relational.
What is an Object Orientated Database?
Info on each entity stored within a single object. Eg each customer had an object to store their own file info.
What is a Multi-dimensional Database?
Data visualised as a collection of cubes. Includes data cubes and hyper cubes (more than three dimensions).
What is a NoSQL Database?
Not only SQL database. Came from the need to have large scale, clustered databases. Useful for unstructured data.
What are the types of NOSQL Database?
Document store - stores semi-structured data by allowing Devs to update code without refering to a central schema. eg JSON and XML.
Wide-column store - organised data into columns rather than rows. Each column has lots of info on the same entity. Can be faster to query large volumes.
Graph store - data stored in nodes, rather than traditional records. Node connections known as edges.
Define Normalisation
The process of organising tables (and their columns) in order to improve data integrity.
What are the two types of Anomalies?
Insertion anomalies - describes when data cannot be added into the table.
Deletion anomalies - describes attributes being lost when other attributes are deleted.
What are the three Normal Forms?
First normal form - stored in a relational table with no multi-valued columns.
Second normal form - all columns depend on the tables primary key.
Third normal form - no column has transitive dependency on primary keys.
Define Data Warehousing
Data stored ready to be dispatched/used.
Explain four Database Maintenance techniques
Log file maintenance - log files contain a history of every transaction against the database.
Log files are a form of redundancy (they’re data additional to the actual data).
Data compaction - frees up unused space for new data, but doesn’t necessarily reduce the size of the database file. May require downtime.
Defragmentation - identifies data that is related, and relocated it to the same physical location to improve performance.
Integrity checks - looks for problems with data that may cause corruption or other problems. Eg a virus scan.
What is a Canonical data model?
Provides a high-level view of entities and their relationships across an organisation.
Define Data Architecture
The set of rules, policies, standards or models set by the organisation that govern the use of its data.
It’s a business process, rather than technical primarily.
Define Data Policies, Standards and Rules
Data policies - a broad framework for how decisions should be made regarding data.
Data standards - provide detailed rules on how to implement data policies.
Data rules - provide specific instructions on how to implement data standards.
Define Data Migration
The transfer of data from storage/computing environment to another.
Define Data Integration
Combining data from different sources to provide a unified view.
What are the four features of Database Architecture?
Database design, data warehousing, migration and integration.
Define Domain Context
Understanding of the business environment the data is in.
Define Decision Analytics?
Using visual data techniques to support choices or decisions made by people.
Define Descriptive Analytics
Focusses entirely on the understanding of historical data. Can inform decision-making.
Define Predictive Analytics
Using historical data to understand or predict the future and inform decisions.
Define Prescriptive Analytics
The integration of predictive analytics into business systems. Seeks to identify what will happen, when and why.
What is a Functional Requirement?
It describes a feature which the solution should have.
What are the steps of the ETL process?
Define the source Define the target Create the mapping Create the session Create the work flow
What is data validation?
The process of ensuring a program operates on clean, correct and useful data.