Data Analysis Concepts Term Glossary Flashcards
Describe ‘Structured data’
Data that is coded in a manner that makes it easily converted into a form usable for data analysis
Describe ‘Semi-Structured’ data
With a single formatting scheme, enabling description of the data (like xml). Can be parsed but data may need to be wrangled (re-formatted)
Describe ‘quasi-structured data’
A little structure but may include multiple formats. Can be formatted with considerable effort
Describe ‘unstructured data’
Data that is more complex (contains various formats and data types) and possibly stored in a format that is not easily decoded.
State the three categories that ‘Qualitative’ data can be
Binomial/Binary - Two exclusive groups i.e Yes/No, Pass/Fail, True/False
Nominal - Multiple groups with no distinct ordering (such as Regions, hair colour, blood groups)
Ordinal - Similar to nominal but with an intrinsic order i.e satisfaction levels, salary bands)
State the two categories that ‘Quantitative’ data can be
Continuous - Decimal numbers measured to a higher precision, i.e heights, speeds, distances, time
Discrete - Normally whole numbers such as counts, ranks, indexes
Define ‘Open’ data
Available in a machine-readable format without restrictions over the ability to use, consume, or share the information.
Define ‘public’ data
Available to the public to collect or look at, but it’s not easily redistributed (or machine readable) and sometimes not easily obtained.
Define ‘proprietary’ data
Data whose ownership is claimed by a specific entity or company. It may be protected under copyright, patent, or trade secret laws.
Define ‘Operational’ data
Data used in the day-to-day business operations Examples: Information on direct competitors, information on suppliers, accounting data and projection of needed resources.
Define ‘Administrative’ data
collected to produce management information. Used to guide future actions but not strictly necessary for the immediate operation of a business.
What does ‘RTF’ stand for, when referring to file types
Rich text format
What does ‘XML’ stand for, when referring to file types
Extensible Mark-up Language
What does ‘JSON’ stand for, when referring to file types
JavaScript Object notation
Describe the structure of a JSON file
Uses Key:Value pairs that are stored in lists, records, and sub-records.
State and describe the three ‘V’s when referring to Big Data
High Volume - can be Peta/Exabytes and distributed across multiple computers/servers
Large Variety - Contains many different data types
Fast Velocity - Data ingestion and data creation is rapid
State the five stages of the data life cycle in order
1) Creation
2) Initial Storage
3) Archiving
4) Obsolete
5) Deleted
Describe the ‘Creation’ stage in the data life cycle
The data is created, either by measurement or collection.
Describe the ‘Initial storage’ stage in the data life cycle
The data is organised and stored to make analysis easier
Describe the ‘Archiving’ stage in the data life cycle
The data is archived with summary data to help future research
Describe the ‘Obsolete’ stage in the data life cycle
The data becomes obsolete, often due to system/processes update or the data being updated
Describe the ‘Deleted’ stage in the data life cycle
All copies of the data are removed
What type of graph am I describing?
“Non-Technical, good for showing non-consecutive data points and excellent for showing relationships and trends between variables.”
Scatter chart
What type of graph am I describing?
“Technical: good for showing more details and features beyond what a scatter chart can achieve.”
Bubble chart
What type of graph am I describing?
“non-technical, shows consecutive changes in a variable as well as the starting and ending value”
Waterfall chart
What type of graph am I describing?
“Technical, best for showing scores or ranks for multiple features/axes.”
Radar/Spider chart
What type of graph am I describing?
“Technical, very good for showing lots of data as surface. Shows X (left right), Y (up down) and Z values (colour of pixel).”
Heat map
What type of graph am I describing?
“Technical, best for showing project progression and planning different stages of a process.”
Gantt Chart
What type of graph am I describing?
“Non-technical, good for showing opinions and sentiments (positive and negative feedback). An excellent graph type for showing qualitative information in reports.”
Word cloud
State and describe the three types of time-series
Time-series - A variable that changes in time. Often measured at regular intervals and can show trends, seasonality and random fluctuations
Stationary time-series - a time series that fluctuates but stays around a set mean value. i.e daily temperature, heart rates, blood pressure, call frequencies
Non-stationary time series - a time series that contains a trend and moves away from a mean value i.e stock prices over 30 years, air passenger numbers over 40 years.
State and describe the two ‘types’ of errors
Type 1 - False positive. This occurs when you have a positive result that turns out to be wrong. I.e, if a fire alarm goes off but it is just caused by air polution
Type 2 - False negative. This occurs when you have a negative result that turns out to be positive. I.e, if there is a fire and the fire alarm doesn’t go off
State the two meanings for data aggregation
1) On a table, this refers to when you calculate summary statistics from a larger set of numbers
2) The process of accumulating data from different sources.
Define ‘data blending’
The process of combining multiple tables together into a single larger file
Define ‘data masking’
Anonymising or redacting data before analysis begins
State and describe the three stages of ETL
1) Extract - extract the data from a larger source. This data may be unprocessed raw data and it may also be unstructured
2) Transform - Change the format and structure of the data so that it can be stored in a manageable database.
3) Load - load the data to a given destination
Describe ‘sampling error’
When more of one measurement is made than another, simply because of random chance
Describe ‘non-sampling error’
Occurs when you introduce a consistent error in your measurements. This error cannot be reduced by taking more samples of the same type.
Describe ‘data reconciliation’
When you combine more samples together in an effort to reduce sampling error
Name the two types that personal information can be split into
Personal - Any record personal to yourself (i.e therapist records)
Identifying - Information that can be used to indentify from others in a dataset
Name the three basic principles in the Data Protection Act (1998)
Incorrect information be corrected
Data cannot be used to cause harm or distress
Not used for direct marketing
Name the eight principles of what data must be as part of the DPA regulation
1) Fairly and lawfully processed
2) Processed for limited purposes
3) Adequate, relevant and not excessive
4) Accurate
5) Not kept for longer than is neccessary
6) Processed in line with your rights
7) Personal data must be kept securely
8) Not transferred to other countries without adequate protection
Name the 8 main principles as part of the new General Data Protection Regulation (GDPR)
1) The right to be informed
2) The right to access
3) The right to rectification
4) The right to erasure
5) The right to restrict processing
6) The right to data portability
7) The right of object
8) Rights related to automated decision making and profiling
State and describe the six common general data structures on a computer
1) Directories
2) Lists
3) Arrays
4) Records
5) Trees
6) Tables
Describe ‘lists’, when referring to general data structures
Contains different data types in one structure.
Less structured and slower to query
Describe ‘Arrays’, when referring to general data structures
Sequence of items, that must contain the same data type
Can be multi-dimensional
Good for mathematical operations and fast querying
Describe ‘Records’, when referring to general data structures
Two types;
1) In a database, a record is a row in a table
2) In an object-based record, this is an object that contains numerous fields/attributes. These records can contain data, other records or functions
Describe ‘Trees’, when referring to general data structures
A type of linked list that contains nodes.
Each node can have daughter nodes
Normally used to show hierarchal data
Describe ‘Tables’, when referring to general data structures
Common store of data - contains rows (records) and columns (fields)
Each field has a single data type.
All fields and records have the same length
State the three types of data models
1) Conceptual data model
2) Logical data model
3) Physical data model
Describe a ‘Conceptual’ data model
Is a basic plan of a database
Contains the tables and links (often with no cardinality)
Used to communicate the basic scheme of a database to non-technical people or to plan
Describe a ‘Logical’ data model
Similar to a conceptual database, but with more detail about the format, primary and foreign keys
Highlights database names and relationships with other databases
Describe a ‘Physical’ data model
Includes more detail than ‘Logical’ (such as data types, database triggers, storage procedures, access constraints/permissions)
Uses specific names for tables and fields
Contains all the information needed to start implementing a database