G5. Descriptive Statistics Flashcards
What is the role of descriptive statistics with regard to the analysis of data collections?
to visualize and summarize the sample distribution, thereby allowing us to make tentative assumptions about the population distribution. providing a concise and meaningful summary of key characteristics of the data
What type of questions can be answered using descriptive statistics? Which are the mathematical tools used for that?
Factual queries, summarizing and presenting the main features of the data. Central Tendency, Variability or Dispersion, Distribution Shape, Frequency and Proportions, Percentiles and Quartiles, Correlation and Relationships, Summary Measures, etc.
Which methods are provided by Python Pandas for getting acquainted with data collections content in a quantitative manner? What about R?
Python:
head() and tail():
info():
describe():
shape():
dtypes():
value_counts():
corr():
isnull(), sum(), heatmap()
R:
head() and tail()
str()
summary()
dim()
class()
table()
cor()
is.na(), sum(), and heatmap()
How is the method shape used for analysing data in a DataFrame? Is there an equivalent in R?
Purpose: Returns the number of rows and columns in a DataFrame.
Usage: df.shape
R: dim()
What issues have to be considered in order to be able to apply statistics to raw data collections?
Data Quality
Data Scale and Units
Sampling Bias
Data Transformation
Statistical Assumptions
What is the role of the generation of graphics in the application of descriptive statistics for analysing data?
Data Exploration
Pattern Recognition
Outlier Detection
Correlation and Relationships
Distribution Analysis
Communicating Results
Which are the strategies used for dealing with dirty data when applying descriptive statistics functions?
Data Cleaning (Handling Missing Values, Correcting Errors)
Outlier Detection and Handling (visual or statistical)
Data Transformation (Logarithmic, normalisation)
Handeling duplicates
Formating
Categorical data (dummy)
Data imputation
Cross validation
- Why can the distribution of the values of a given attribute be important to be known in a data analytics process?
It provides a basis for descriptive statistics, aids in data exploration, and guides subsequent analytical steps. A thorough understanding of the distribution enhances the accuracy and reliability of insights gained from the data analytics process