Data Analysis Concepts Term Glossary Flashcards

1
Q

Describe ‘Structured data’

A

Data that is coded in a manner that makes it easily converted into a form usable for data analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Describe ‘Semi-Structured’ data

A

With a single formatting scheme, enabling description of the data (like xml). Can be parsed but data may need to be wrangled (re-formatted)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Describe ‘quasi-structured data’

A

A little structure but may include multiple formats. Can be formatted with considerable effort

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Describe ‘unstructured data’

A

Data that is more complex (contains various formats and data types) and possibly stored in a format that is not easily decoded.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

State the three categories that ‘Qualitative’ data can be

A

Binomial/Binary - Two exclusive groups i.e Yes/No, Pass/Fail, True/False
Nominal - Multiple groups with no distinct ordering (such as Regions, hair colour, blood groups)
Ordinal - Similar to nominal but with an intrinsic order i.e satisfaction levels, salary bands)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

State the two categories that ‘Quantitative’ data can be

A

Continuous - Decimal numbers measured to a higher precision, i.e heights, speeds, distances, time
Discrete - Normally whole numbers such as counts, ranks, indexes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Define ‘Open’ data

A

Available in a machine-readable format without restrictions over the ability to use, consume, or share the information.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Define ‘public’ data

A

Available to the public to collect or look at, but it’s not easily redistributed (or machine readable) and sometimes not easily obtained.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Define ‘proprietary’ data

A

Data whose ownership is claimed by a specific entity or company. It may be protected under copyright, patent, or trade secret laws.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Define ‘Operational’ data

A

Data used in the day-to-day business operations Examples: Information on direct competitors, information on suppliers, accounting data and projection of needed resources.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Define ‘Administrative’ data

A

collected to produce management information. Used to guide future actions but not strictly necessary for the immediate operation of a business.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does ‘RTF’ stand for, when referring to file types

A

Rich text format

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What does ‘XML’ stand for, when referring to file types

A

Extensible Mark-up Language

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What does ‘JSON’ stand for, when referring to file types

A

JavaScript Object notation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Describe the structure of a JSON file

A

Uses Key:Value pairs that are stored in lists, records, and sub-records.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

State and describe the three ‘V’s when referring to Big Data

A

High Volume - can be Peta/Exabytes and distributed across multiple computers/servers
Large Variety - Contains many different data types
Fast Velocity - Data ingestion and data creation is rapid

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

State the five stages of the data life cycle in order

A

1) Creation
2) Initial Storage
3) Archiving
4) Obsolete
5) Deleted

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Describe the ‘Creation’ stage in the data life cycle

A

The data is created, either by measurement or collection.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Describe the ‘Initial storage’ stage in the data life cycle

A

The data is organised and stored to make analysis easier

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Describe the ‘Archiving’ stage in the data life cycle

A

The data is archived with summary data to help future research

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Describe the ‘Obsolete’ stage in the data life cycle

A

The data becomes obsolete, often due to system/processes update or the data being updated

22
Q

Describe the ‘Deleted’ stage in the data life cycle

A

All copies of the data are removed

23
Q

What type of graph am I describing?

“Non-Technical, good for showing non-consecutive data points and excellent for showing relationships and trends between variables.”

A

Scatter chart

24
Q

What type of graph am I describing?

“Technical: good for showing more details and features beyond what a scatter chart can achieve.”

A

Bubble chart

25
Q

What type of graph am I describing?

“non-technical, shows consecutive changes in a variable as well as the starting and ending value”

A

Waterfall chart

26
Q

What type of graph am I describing?

“Technical, best for showing scores or ranks for multiple features/axes.”

A

Radar/Spider chart

27
Q

What type of graph am I describing?

“Technical, very good for showing lots of data as surface. Shows X (left right), Y (up down) and Z values (colour of pixel).”

A

Heat map

28
Q

What type of graph am I describing?

“Technical, best for showing project progression and planning different stages of a process.”

A

Gantt Chart

29
Q

What type of graph am I describing?

“Non-technical, good for showing opinions and sentiments (positive and negative feedback). An excellent graph type for showing qualitative information in reports.”

A

Word cloud

30
Q

State and describe the three types of time-series

A

Time-series - A variable that changes in time. Often measured at regular intervals and can show trends, seasonality and random fluctuations

Stationary time-series - a time series that fluctuates but stays around a set mean value. i.e daily temperature, heart rates, blood pressure, call frequencies

Non-stationary time series - a time series that contains a trend and moves away from a mean value i.e stock prices over 30 years, air passenger numbers over 40 years.

31
Q

State and describe the two ‘types’ of errors

A

Type 1 - False positive. This occurs when you have a positive result that turns out to be wrong. I.e, if a fire alarm goes off but it is just caused by air polution

Type 2 - False negative. This occurs when you have a negative result that turns out to be positive. I.e, if there is a fire and the fire alarm doesn’t go off

32
Q

State the two meanings for data aggregation

A

1) On a table, this refers to when you calculate summary statistics from a larger set of numbers
2) The process of accumulating data from different sources.

33
Q

Define ‘data blending’

A

The process of combining multiple tables together into a single larger file

34
Q

Define ‘data masking’

A

Anonymising or redacting data before analysis begins

35
Q

State and describe the three stages of ETL

A

1) Extract - extract the data from a larger source. This data may be unprocessed raw data and it may also be unstructured
2) Transform - Change the format and structure of the data so that it can be stored in a manageable database.
3) Load - load the data to a given destination

36
Q

Describe ‘sampling error’

A

When more of one measurement is made than another, simply because of random chance

37
Q

Describe ‘non-sampling error’

A

Occurs when you introduce a consistent error in your measurements. This error cannot be reduced by taking more samples of the same type.

38
Q

Describe ‘data reconciliation’

A

When you combine more samples together in an effort to reduce sampling error

39
Q

Name the two types that personal information can be split into

A

Personal - Any record personal to yourself (i.e therapist records)

Identifying - Information that can be used to indentify from others in a dataset

40
Q

Name the three basic principles in the Data Protection Act (1998)

A

Incorrect information be corrected
Data cannot be used to cause harm or distress
Not used for direct marketing

41
Q

Name the eight principles of what data must be as part of the DPA regulation

A

1) Fairly and lawfully processed
2) Processed for limited purposes
3) Adequate, relevant and not excessive
4) Accurate
5) Not kept for longer than is neccessary
6) Processed in line with your rights
7) Personal data must be kept securely
8) Not transferred to other countries without adequate protection

42
Q

Name the 8 main principles as part of the new General Data Protection Regulation (GDPR)

A

1) The right to be informed
2) The right to access
3) The right to rectification
4) The right to erasure
5) The right to restrict processing
6) The right to data portability
7) The right of object
8) Rights related to automated decision making and profiling

43
Q

State and describe the six common general data structures on a computer

A

1) Directories
2) Lists
3) Arrays
4) Records
5) Trees
6) Tables

44
Q

Describe ‘lists’, when referring to general data structures

A

Contains different data types in one structure.

Less structured and slower to query

45
Q

Describe ‘Arrays’, when referring to general data structures

A

Sequence of items, that must contain the same data type
Can be multi-dimensional
Good for mathematical operations and fast querying

46
Q

Describe ‘Records’, when referring to general data structures

A

Two types;

1) In a database, a record is a row in a table
2) In an object-based record, this is an object that contains numerous fields/attributes. These records can contain data, other records or functions

47
Q

Describe ‘Trees’, when referring to general data structures

A

A type of linked list that contains nodes.
Each node can have daughter nodes
Normally used to show hierarchal data

48
Q

Describe ‘Tables’, when referring to general data structures

A

Common store of data - contains rows (records) and columns (fields)
Each field has a single data type.
All fields and records have the same length

49
Q

State the three types of data models

A

1) Conceptual data model
2) Logical data model
3) Physical data model

50
Q

Describe a ‘Conceptual’ data model

A

Is a basic plan of a database
Contains the tables and links (often with no cardinality)
Used to communicate the basic scheme of a database to non-technical people or to plan

51
Q

Describe a ‘Logical’ data model

A

Similar to a conceptual database, but with more detail about the format, primary and foreign keys
Highlights database names and relationships with other databases

52
Q

Describe a ‘Physical’ data model

A

Includes more detail than ‘Logical’ (such as data types, database triggers, storage procedures, access constraints/permissions)
Uses specific names for tables and fields
Contains all the information needed to start implementing a database