Term Glossary (Topic 2.1) Flashcards by Samuel Bull

State the six common data types supported in R

1) Lists
2) Vectors
3) Arrays
4) Trees
5) Data Frames
6) Collections

How well did you know this?

Not at all

Perfectly

Define ‘Lists’ when referring to common data types in R

Groups of data that can have different formats.

How well did you know this?

Not at all

Perfectly

Define ‘Vectors’ when referring to common data types in R

Groups of data with the same data type.

How well did you know this?

Not at all

Perfectly

Define ‘Arrays’ when referring to common data types in R

Like Vectors but can span multiple dimensions.

How well did you know this?

Not at all

Perfectly

Define ‘Trees’ when referring to common data types in R

Hierarchical groups of data.

How well did you know this?

Not at all

Perfectly

Define ‘Data Frames’ when referring to common data types in R

Tables of data (similar to classic database tables).

How well did you know this?

Not at all

Perfectly

Define ‘Collections’ when referring to common data types in R

Used to Store Key: Value pairs.

How well did you know this?

Not at all

Perfectly

State the four main data handling methods

1) Searching and sorting
2) Grouping
3) Filtering
4) Modelling

How well did you know this?

Not at all

Perfectly

Define ‘ Searching and Sorting’ when referring to the main data handling methods

Data sorting involves reordering the data according to given column(s). Searching involves identifying a value of interest. This value may be a primary key that matches with a foreign key from another table.

How well did you know this?

Not at all

Perfectly

Define ‘Grouping’ when referring to the main data handling methods

This another term for data aggregation which involves numerous values being categorised together according to a common value from other column(s). These values are then summarised into a single value such as a count, average, min, max… etc.

How well did you know this?

Not at all

Perfectly

Define ‘Filtering’ when referring to the main data handling methods

The process of identifying a subset of data that satisfies a given condition.

How well did you know this?

Not at all

Perfectly

Define ‘Modelling ‘ when referring to the main data handling methods

Placing your data into a database model that best enables analysis. This may involve denormalization (joining relevant data together prior to analysis).

How well did you know this?

Not at all

Perfectly

State six different ways in which errors can occur within datasets

1) Missing data
2) Inconsistent data
3) Redundant data
4) Invalid data
5) Data out of range
6) Outliers

How well did you know this?

Not at all

Perfectly

Describe ‘missing data’ when referring to the ways errors can occur in datasets

such as Nulls, missing rows (gives incomplete data)

How well did you know this?

Not at all

Perfectly

Describe ‘Inconsistent data’ when referring to the ways errors can occur in datasets

equivalent data stored in different locations is not consistent or is stored in different formats (gives inconsistent data)

How well did you know this?

Not at all

Perfectly

Describe ‘Redundant data’ when referring to the ways errors can occur in datasets

Study These Flashcards

data that does not tell us anything we could not gather from other sources. Such as knowing someone’s age in years and their age in months.

Describe ‘Invalid data’ when referring to the ways errors can occur in datasets

Study These Flashcards

Data that is of the wrong format, structure or otherwise cannot be used for calculation.

Describe ‘Data out of range’ when referring to the ways errors can occur in datasets

Study These Flashcards

Values that are outside of an expected range such that they could not have possibly occurred. An example could be having a count below 0 or setting someone’s age as 150 years.

Describe ‘outliers’ when referring to the ways errors can occur in datasets

Study These Flashcards

Similar to data out of range. This occurs when a value is so far above or below the values of the rest of the dataset that it is reasonable to exclude it from further analysis. In contrast to data that is out of range, outliers can occur naturally but they are so rare that they can be discounted from analysis.

Define ‘Bias’ when referring to functions used in analysis

Study These Flashcards

When measuring or recording something, bias is the average amount each measurement is different from the true theoretical value. For example, if you measured 1000 peoples’ heights but forgot to ask them to take their shoes off, then you will have measured their heights with an added bias. The bias would be the average thickness of the soles of peoples’ shoes.

Define ‘Linear Regression’ when referring to functions used in analysis

Study These Flashcards

A set of mathematical models that assume there is a linear relationship between different variables. Linear regression can be used to predict values, analyse the strength of relationship between variables and assess the statistical significance of these relationships.

Define ‘Correlation’ when referring to functions used in analysis

Study These Flashcards

(more accurately, the correlation coefficient) is a score that quantifies the strength of relationship between two variables. It ranges from -1 (perfectly anti-correlated to 1 perfectly correlated). A score of 0 represents no correlation between two variables.

Define ‘Stem and Leaf Plots’ when referring to functions used in analysis

Study These Flashcards

way to display numbers and counts and are used to show how measurements are distributed. Digits of the numbers are shown on the right and the tens/hundred values are shown on the left.

Explain when you would use a ‘NoSQL’ database

Study These Flashcards

When data is unstructured and stored in non-relational databases.

Define the four common types of 'NoSQL' databases

1) Graph 2) Document 3) Column Family/Wide Column 4) Key : Value

Define 'Graph' when referring to types of 'NoSQL' databases

Stores data as nodes across a highly connected network. Nodes are connected with ‘weak links’ that act to describe the relationship one piece of data has with another

Define 'Document' when referring to types of 'NoSQL' databases

Stores data in sets of easily sortable and searchable documents such as XML or JSON files.

Define 'Column Family/Wide Column' when referring to types of 'NoSQL' databases

Data is stored in a large table that can be split up into numerous sub-tables. Each sub-table may have its own key identifier.

Define 'Key:Value" when referring to types of 'NoSQL' databases

Stores data as a series of nested Key : Value stores. For example, there may be a main Key with an associated value, that value may itself be another Key : value pair.

Define 'Probability' when referring to functions used in data analysis

This is a measure of the chance that an event will occur. Probabilities range from 0 to 1 where 0 has no chance of happening and 1 denotes an event that will always happen. Probabilities can also refer to multiple events occurring.

Define 'Statistical Significance' when referring to functions used in data analysis

This describes using statistical methods to quantify how likely a result you have could have been obtained purely by chance (defined by the NULL Hypothesis). We often measure this significance using the p-value from statistical tests.

Define 'Scatter Plots' when referring to functions used in data analysis

Scatter plots show values from one variable plotted against values of another. Scatter plots are useful for showing the correlation/relationships between variables.

Define 'Factorial Numbers' when referring to functions used in data analysis

Factorial numbers are used to calculate the combinations of set of numbers/items. A basic factorial number is the product/multiple of all the numbers up and including it. There are represented in mathematics with an exclamation mark (n!).

Define 'Box & Whisker Plots' when referring to functions used in data analysis

These plots show the spread of several values using a box (with limits being the 1st and 3rd quartile) and whiskers showing the minimum and maximum of the range of values. Commonly, the whiskers show the maximum and minimum values excluding outlier values.

State three ways in which results can be presented

1) Tables 2) Charts 3) Graphs

State ten challenges that can occur when integrating large datasets

1) Implementation of MDM 2) Organisation, policy, rules & requirements 3) Redesigning data models 4) Limitation of software/hardware 5) Data Ownership Issues 6) Security 7) Effects on performance 8) High Volume 9) High Variety 10) Increased Support Costs

What package is required to create dataframes in Python

Pandas`

Define a 'Tuple' and what program would you find it in?

Simple group of items similar to a list, often used in functions to output multiple results. Python

What three factors should be considering for 'Performance Stress Testing'

1) The time taken to run the test 2) Effect on memory 3) Effect on processors

Term Glossary (Topic 2.1) Flashcards

(39 cards)