Theory Questions Flashcards

Question 1

Q

What is data cleansing and how do we practice it?

Answer

A

It’s a key part of the process of working with a large dataset.
Issues can arise because of human error or inconsistencies resulting from combining several different data sets.
It is impossible to ensure you are working with accurate and consistent data before you work with it. Otherwise, you will arrive at inaccurate conclusions.

The data cleaning steps are:
- Removing duplicate data which can skew your data
- Removing irrelevant data to improve the efficiency of your analysis
- Identify incomplete data and decide how to address this - removing incomplete values, or replacing them with suitable substitutes such as mean or median values
- Identify outliers which could distort analysis of the data, and decide whether to include or omit them - they can be efficiently identified using visualisations such as histograms or scatter plots.
- Ensure the data is consistently formatted - making sure capitalisation is standard, and unwanted characters such as whitespace are removed
- Remove any errors - such as spelling mistakes, or incorrect formatting for measurements
- Ensure data is stored in the correct datatype (eg numbers stored as integers, words stored as strings) so the correct analysis can be performed
- Validate the data using a sample to ensure all the above processes have resulted in a consistent, accurate and complete dataset.

Question 2

Q

What is the difference between data profiling and data mining?

Answer

A

Important ways to understand the dataset you are working with and its metadata
Data profiling is examining unprocessed data to understand the content and structure, identify patterns, summarise important statistics.
You can assess important characteristics about data quality, uniqueness, and consistency.
You can then identify and address abnormalities in the data.
Data mining is extracting insights and statistics from data using algorithms.
You can segment the data, identify trends and correlations
This can help inform business decisions and predict future trends

Question 3

Q

Define Outlier with an example.

Answer

A

Outliers are unusual values in a dataset which are abnormally far away from the other data points.
They can cause issues in data analysis because they can skew the results.
However, they can also sometimes give you useful insights about the data and highlight extreme cases.

There are different types of outliers:

Global outliers are single data points which are much larger or smaller than the whole of the dataset.
Contextual outliers are values which are significantly different from other data points in the same context.
Collective outliers are a subset of data points which differ from the rest of the dataset.

Outliers can arise for a number of different reasons, including:

Human error during data input
Sampling errors from extracting data from inaccurate sources
Data processing errors during data manipulation
Measurement errors
Natural outliers which occur in the dataset and are not the result of an error

You can deal with outliers in various ways, such as:

Excluding the outliers from the data sample, when they have occured because of an error
Analyse the outliers seperately, this will allow you to investigate extreme cases
Use clustering methods to find the best approximate value to allocate to the outliers

Question 4

Q

What is collaborative filtering?

Answer

A

Collaborative filtering is a technique used in recommender systems.
It allows you to make recommendations to users based on the behaviour and preferences of similar users. It is a widely used technique which has been adopted by platforms including Amazon, YouTube and Netflix. It works by searching a userbase to identify a subset of users with similar preferences to a particular user, it then creates a list of suggestions based on the tastes of this subset. It works on the premise that users that have displayed similar behaviour in the past are likely to do so again. User tastes can be measured by analysing ratings that a user has given a piece of content (explicit feedback), or its interaction with that content, such as clicks, views, searches, purchases etc (implicit feedback.)

Question 5

Q

What is time series analysis?

Answer

A

A time series is a series of data points collected through repeated measurements at fixed regular intervals in time, for example daily sales data, or weekly market trends. The data points are indexed in time order, and can be plotted on a graph with time as one of the axes (usually the x-axis.) This can be used to track change and identify trends and patterns over time. Time series analysis allows you to observe long term trends, seasonal changes, and irregular, short term fluctuations. By understanding the factors behind trends and patterns in time series data, you can use forecasting to anticipate future changes.

Question 6

Q

Explain the core steps of a Data Analysis project?

Answer

A

Identify the problem or question your project is aiming to address. Define the scope of your work and the overall objective of your project.
Source relevant datasets, understand what format they are in and how they can be accessed. Ways to source data include:
- Connecting to a database
- Using an API to extract data
- Finding open datasets onlineFinding ways to merge and combine datasets from different sources will allow you to enrich your data and reach more interesting conclusions.
Extract and clean the data - eg removing duplicate data, fixing formatting or spelling errors, ensuring it is stored in the correct datatype. Identify and deal with any missing data and outliers. Ensure the data is being stored in a way that complies with data protection regulations. Organise the data in a structure that allows it to be analysed.
Perform exploratory data analysis to understand the key characteristics of your data, highlighting interesting trends and anomalies in the data which need to be resolved. You can use different statistical modelling methods including linear regression and clustering, which can highlight interesting patterns in the data and predict future trends.
Evaluate your model to ensure it is working as expected, and if it is producing the necessary information. Identify whether any further data is needed, or if your dataset needs further cleaning, refining, or organising, to achieve the intended objective.
Deploy your model - create visualisations to illustrate your conclusions. You may use tools such as Tableau or Power BI to communicate your findings to your clients or stakeholders.

Question 7

Q

What are the characteristics of a good data model?

Answer

A

Data modeling is a process for defining and ordering data for use and analysis by certain business processes.

The data should be easy to consume by people outside the data team. They should be able to understand and generate models themselves.
It should be easy to use and maintain.
The data should be presented in a simple and accessible way with clear descriptions, and it should be easy to identify whether a data source is accurate and up to date.
It should be reliable
It should be scalable and adaptable as the amount of data input grows or shows
It should be flexible enough to adapt to changing organisational requirements and priorities, without having to create a new model

Question 8

Q

What are univariate, bivariate, and multivariate analysis?

Answer

A

Univariate analysis - analysis that looks at data with only one type of variable. Typically this would involve describing the data and finding patterns within it. A common method is looking at values such as the mean, median, and mode of the variable, the maximum and minimum values, and standard deviation. Common ways to visualise univariate analysis are histograms, which illustrate frequency distribution, and box plots, which show the spread of variables and highlight outliers.

Bivariate analysis - analysis where you are comparing two variables (an X value and a Y value) and exploring their relationship. Common methods to visualise bivariate analysis are using a scatter plot, where variables are plotted on a horizontal and vertical axis, or a regression plot, which helps visual relationships between values by creating a regression line. You can also examine the correlation coefficient which indicates the strength of the linear relationship between two variables.

Multivariate analysis - analysis where you are comparing more than two variables. One method is to create a 3D model to examine the relationship between three variables. Other methods include principal component analysis, logistic regression, linear regression, and cluster analysis.

Question 9

Q

What is a Linear Regression?

Answer

A

Linear regression is a common type of predictive analysis. The aim is to understand whether a set of variables can predict the outcome of a dependent variable, identify which variable is the most significant predictor of the outcome variable, and understand in what way they impact the variable. The simplest form of a regression is defined by the formula: y = c + b*x when y = estimated dependent variable score, c = constant, b = regression coefficient, and x = score on the independent variable.

A regression coefficient of -1 or 1 indicates a perfect linear relationship. A coefficient of 0 indicates no linear relationship. Positive coefficients indicate that as the value of one variable increases, the other tends to increase. Negative coefficients indicate that as the value of one variable increases, the other tends to decrease.

The main uses for regression analysis are: determining the strength of predictors, forecasting an effect, and forecasting trends.

Question 10

Q

In terms of modelling data, what do we mean by Over-fitting and Under-fitting?

Answer

A

When developing a machine learning model, we aim for the model to learn general concepts from specific examples which are based on the problem the model is intended to solve. This is called inductive learning. A successful model can generalize effectively from the training data to apply the concepts it has learned to examples that it hasn’t seen before, and therefore make predictions. Overfitting and underfitting are terms that refer to how well a machine learning model can generalize to new data; they are both causes of poor performance and inaccuracy.

Overfitting is when a model learns the training data too closely, including noise or random fluctuations in the dataset; because this random noise won’t always apply to new data sets it can therefore affect the model’s ability to generalise. The model will perform well on the training data but will not perform well with new examples.

Conversely, underfitting is when a model has not sufficiently learned the training data and also cannot generalize to new data. If a model cannot understand the relationship between the input examples and the target values, it will not perform well on the training data and will not be able to make predictions.

Question 11

Q

What is exploratory analysis?

Answer

A

Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns,to spot anomalies,to test hypothesis and to check assumptions with the help of summary statistics and graphical representations.
- Always try to understand the data you work with first
- Gather as many insights from a dataset as possible
- Make sense of the data in hand before starting any processing or manipulation