Term Glossary (Topic 2.1) Flashcards
State the six common data types supported in R
1) Lists
2) Vectors
3) Arrays
4) Trees
5) Data Frames
6) Collections
Define ‘Lists’ when referring to common data types in R
Groups of data that can have different formats.
Define ‘Vectors’ when referring to common data types in R
Groups of data with the same data type.
Define ‘Arrays’ when referring to common data types in R
Like Vectors but can span multiple dimensions.
Define ‘Trees’ when referring to common data types in R
Hierarchical groups of data.
Define ‘Data Frames’ when referring to common data types in R
Tables of data (similar to classic database tables).
Define ‘Collections’ when referring to common data types in R
Used to Store Key: Value pairs.
State the four main data handling methods
1) Searching and sorting
2) Grouping
3) Filtering
4) Modelling
Define ‘ Searching and Sorting’ when referring to the main data handling methods
Data sorting involves reordering the data according to given column(s). Searching involves identifying a value of interest. This value may be a primary key that matches with a foreign key from another table.
Define ‘Grouping’ when referring to the main data handling methods
This another term for data aggregation which involves numerous values being categorised together according to a common value from other column(s). These values are then summarised into a single value such as a count, average, min, max… etc.
Define ‘Filtering’ when referring to the main data handling methods
The process of identifying a subset of data that satisfies a given condition.
Define ‘Modelling ‘ when referring to the main data handling methods
Placing your data into a database model that best enables analysis. This may involve denormalization (joining relevant data together prior to analysis).
State six different ways in which errors can occur within datasets
1) Missing data
2) Inconsistent data
3) Redundant data
4) Invalid data
5) Data out of range
6) Outliers
Describe ‘missing data’ when referring to the ways errors can occur in datasets
such as Nulls, missing rows (gives incomplete data)
Describe ‘Inconsistent data’ when referring to the ways errors can occur in datasets
equivalent data stored in different locations is not consistent or is stored in different formats (gives inconsistent data)
Describe ‘Redundant data’ when referring to the ways errors can occur in datasets
data that does not tell us anything we could not gather from other sources. Such as knowing someone’s age in years and their age in months.
Describe ‘Invalid data’ when referring to the ways errors can occur in datasets
Data that is of the wrong format, structure or otherwise cannot be used for calculation.
Describe ‘Data out of range’ when referring to the ways errors can occur in datasets
Values that are outside of an expected range such that they could not have possibly occurred. An example could be having a count below 0 or setting someone’s age as 150 years.
Describe ‘outliers’ when referring to the ways errors can occur in datasets
Similar to data out of range. This occurs when a value is so far above or below the values of the rest of the dataset that it is reasonable to exclude it from further analysis. In contrast to data that is out of range, outliers can occur naturally but they are so rare that they can be discounted from analysis.
Define ‘Bias’ when referring to functions used in analysis
When measuring or recording something, bias is the average amount each measurement is different from the true theoretical value. For example, if you measured 1000 peoples’ heights but forgot to ask them to take their shoes off, then you will have measured their heights with an added bias. The bias would be the average thickness of the soles of peoples’ shoes.
Define ‘Linear Regression’ when referring to functions used in analysis
A set of mathematical models that assume there is a linear relationship between different variables. Linear regression can be used to predict values, analyse the strength of relationship between variables and assess the statistical significance of these relationships.
Define ‘Correlation’ when referring to functions used in analysis
(more accurately, the correlation coefficient) is a score that quantifies the strength of relationship between two variables. It ranges from -1 (perfectly anti-correlated to 1 perfectly correlated). A score of 0 represents no correlation between two variables.
Define ‘Stem and Leaf Plots’ when referring to functions used in analysis
way to display numbers and counts and are used to show how measurements are distributed. Digits of the numbers are shown on the right and the tens/hundred values are shown on the left.
Explain when you would use a ‘NoSQL’ database
When data is unstructured and stored in non-relational databases.