Danny's Zenne Stof Flashcards
What is data analytics in exploratory analytics?
It’s about the extraction of useful information and knowledge from large volumes of data, in order to improve decision making.
Why do we do data exploration?
We explore our data in order to understand it better.
How do you get started on data exploration?
- Import the data in the right format.
- Understand the meaning of the variables.
- Understand their typical values.
- Understand how values interact with each other.
- Understand how to combine different datasets.
- Understand the data types.
- Are there missing values?
- Are there outliers?
- What is the overall quality of our data.
Exploring data is two-folded: explain.
Descriptive statistics: distributions, relationships, …
Visualizations: scatterplots, histograms, barplots, …
Data visualization is arguably the most important. The information conveyed via visuals can be very quickly absorbed by the human brain.
You cannot prepare the data without understanding the data. Explain?
You need to know the quality of your data to know what to do in your preparation steps. Are there outliers, missing values, …etc.
A good star is half the battle… why?
You can’t begin working on your project unless you know and understand your data.
Datasets typically consist of rows and columns. What do they mean?
Rows are the observations/data points/entities.
Columns are the attributes/features/attributes/variables of your observations.
Which kinds of data sources are there?
Internal and External.
Give some examples of internal data sources.
- Company Website
- Customer Information: make sure to contact the privacy responsible before working with Personally Identifiable Information! (GDPR)
- Operations/Logistics data
- Financial data
Give some examples of external data sources
- APIs (e.g. tweets)
- Public Records (open source data, available to anyone, e.g. government)
- Manually Labelled (e.g. reCaptcha, labeled customer reviews, …)
What kinds of data storage exists?
- Servers on Premise (small- to medium-sized datasets)
- Cloud (any kind of dataset)
e.g. Amazon AWS, Google Cloud, Azure Cloud, …
What kinds of data do you have? Give some examples.
Structured:
- Tabular data
- Customer information
- Transactional data
Unstructured:
- Text
- Video
- Audio
- Web Pages
- Social Media
What kind of databases are used for structured data?
Relational databases.
What kind of databases are used for unstructured data?
Document databases.
What query language is used to access document databases?
NoSQL
What query language is used to access relational databases?
SQL
How can we turn unstructured data into structured data?
By means of feature extraction.
What is character encoding and why is it important?
Character encoding is used to tell the software how to interpret the bytes of your data. This is important to that your data is accurately/correctly interpreted.
Default encodings include UTF-8 and Latin1. Latin1 cannot interpret Kanji, for example.
What are missing values?
Missing values are values that are missing from your dataset.
What are some important steps to consider when importing data?
- Are we using the correct character encoding?
- Are there any missing values?
What types of data are there?
Categorical:
- Nominal (unranked)
- Ordinal (ranked)
Numerical:
- Discrete (counted, not measured)
- Continuous (measured, not counted)
What is nominal data and give some examples.
Categorical data that does not indicate an order between the values.
- Male/Female
- Colours (red, green, blue)
What is ordinal data and give some examples.
Categorical data that does have some kind of order.
- Small, Medium, Large
- First Class, Second Class, Third Class
- Temperature labeled as “cold, mild, hot”
What is continuous data and give some examples.
Continuous data is data that can be measured, but not counted.
- Length
- Weight
What is discrete data and give some examples.
Discrete data can be counted, but not measured.
- Number of students
- Number of pens in the box.
- Number of chickens that walked out of the chicken coop.
What kind of statistics can you do with nominal data?
You can count the frequencies.
You can count the proportions.
How can you visualize nominal data?
Barcharts and piecharts.
What kind of statistics can you do with ordinal data?
Frequencies, proportions.
Percentiles and median.
What kind of statistics can you do with continuous and discrete data?
You can summarize your data using percentiles, median, mean, standard deviation, range …
How can you visualize numeric data?
Histograms.
Boxplots.
Which type of plot can show outliers? Histograms or Boxplots?
Boxplots. Histograms only show tendencies of your data, not individual outliers.
What do you call a variable that identifies a sample?
An object identifier.
Give some examples of object identifiers.
Row indexes, names, database ids.
What kind of information does a histogram give you?
The general tendencies of your data.
What are descriptive statistics?
Descriptive statistics give you insights by summarizing the data.
Give some examples of descriptive statistics.
- Average of the annual income.
- Median home prices in the neighbourhood.
- Range of credit scores of a population.
What is univariate exploration?
This is the analysis of one attribute at a time.
What is the mean?
This is the average of all observations in a dataset for a certain variable.
What is the median?
This is the value of the central point in the distribution of the dataset for a certain variable.
What is variability?
Variability is the range between which valid values are posed. For instance, two ranges with similar means and median values can have vastly different variabilities if their minimums and maximums are different.
What is range?
Range is the difference between the minimum and maximum value.
The range is very susceptible to the presence of outliers and fails to consider the distribution of all data points in the attribute.
What is spread?
Spread is quantified by the deviation and variance.
What is deviation?
The difference between the observation and the mean of a value.
What is variance?
Variance is the squared deviation of a variable from its mean.
What is standard deviation?
The squared deviation of the variance.
What does it mean where an attribute has a high standard deviation?
The datapoints are spread widely from its central point.
What does it mean when an attribute has a low standard deviation?
It means that the datapoints are spread closely around the central point.
What is multivariate exploration?
It means that we study more than one attribute simultaneously.
What is correlation?
Correlation measures the statistical relationship between two attributes.
What is spurious correlation?
A correlation that happens by accident, or because of an (unseen) third factor.
It’s a correlation that’s not causal.
What is the pearson correlation coefficient?
A value (r) that can be between -1 and 1. It describes how strongly correlated two variables are.
-1 : strongly negatively correlated
1 : strongly positively correlated
0 : no correlation at all
Pearson’s correlation coefficient is sensitive to outliers. Correct?
Yes.
What do we use scatterplots for?
We use scatterplots to compare 2 numerical attributes. We can compare more attributes by using colours, shapes, etc. to plot a third attribute.
What is a histogram? What do you use it for?
A histogram can be used to visualize the distribution of data by plotting the frequency of occurrence in a range.
What’s the optimal number of bins or binwidth in a histogram?
There is no optimal number, it depends on the data.
How can we compare the histograms of a categorical third factor?
By using colours. This could be useful to see how the X and Y attribute compares for various values of a third categorical variable.
What is a boxplot?
A boxplot is a simple but powerful visual way of showing the distribution of a numerical variable. A boxplot shows useful information like outliers and interquartiles.
What makes boxplots interesting?
You can compare them easily.
What is Q1, Q2 and Q3 in a boxplot?
Q1 and Q3 indicate the edges of the box. Q2 indicates the mean of the distribution.
What is R²?
The model fit. A higher number indicates a better model fit.
Where are the data samples located on a linear regression between two highly correlated numerical variables?
Very close to the linear regression line.
Where are the data samples located on a linear regression between two lowly correlated numerical variables?
Very scattered and not along the linear regression line.
Do outliers strongly influence the linear regression calculation?
Yes.
What is a scatter matrix?
It’s annoying to calculate scatterplots for each numerical attribute in datasets with many numerical features.
You can use a scatter matrix to quickly show comparisons for all of them.
A scatter matrix will show scatter plots for each pair of attributes below the main diagonal.
The main diagonal will show histograms of the attribute it represents.
Above the main diagonal will be the r-value that shows how correlated it is.