Exploratory Data Analysis and Data Visualization Flashcards
What are data sets made up of?
Data objects
i.e. medical dataset: patients, treatments, medicine
Database rows - data objects
Database columns - attributes
(also called samples, examples, instances, data points, objects)
What are attributes in a dataset?
A data field representing a characteristic or feature of the data object
i.e. customer_ID
columns of the dataset
What different attribute types do we know?
- Nominal: Categories, states or names of things
i. e. Hair-color = {black, blond, red, etc.} - Binary: Nominal attributes with only 2 states (0 and 1)
i. e. gender, medical test - Ordinal: Values have a meaningful order (ranking) but magnitude between successive values in not known
i. e. size = {small, medium, large} or army ranks
What is the difference between discrete and continuous attributes?
- Discrete attributes
- Only a finite OR countable infinite set of values (zip codes)
- binary attributes are a special case of discrete attributes - Continuous attribute
- has real numbers as attribute values
e. g. height, temperature, weight
- usually represented as floating point variables
What is the mean?
The average value of a dataset
calculated by the sum of all values divided by the number of values
What is the median?
Middle value –> odd number of values
average of two middle values –> even number of values
What is the mode?
Value that occurs the most frequently
What is a Boxplot?
- The end of the boxes mark the quartiles
- The median is marked
- The whiskers mark the outliers
What does a Histogram display?
Graph display of tabulated frequencies , shown as bars
What is the difference between bar chart and histogram?
Difference to bar chart: The bar denotes the value not the height
What is a scatter plot good for?
Provides a look at clusters of points and outlier
What interface do you use in python to visualize data?
pyplot interface of the matplotlibrary