1.3 Understanding your data set Flashcards
Observation
are instances of some group of interest and are generally represented as the rows of a worksheet. If you had data on individual students, for example, each student would count as a single observation. If you had data on different songs, each song would count as a single observation.
characteristics
are what each column within a data set represents. In Figure 2, for instance, each observation, or student, has only a single characteristic: “Classroom.” In the second image, another characteristic is added: “Semester.” While both of these examples are quite simple, in principle, you could have any number of characteristics.
Discrete
data is data that can only be represented through whole numbers (e.g., the number of students in a class or the number of animals in a zoo). You couldn’t have half of a student, or, say, .378 of a leopard (unless you’re looking at data from some kind of horror movie!).
Continuous
data, on the other hand, is measured along a scale and can take any point along that scale as its value. (for tempature 94.5 93.5 etc
Nominal
the categories have no order. If you were categorizing cars, for example, you could have categories for each manufacturer (e.g., Honda, Ford, Toyota, etc.). As none of these categories are more or less than the other categories, there’s no implicit order to how you might organize them.
Honda, Ford, Toyota
Ordinal
data in which there exists an implicit order to the way it’s organized, f
Small, Medium, Large
Binary
categorizes data into two groups
Good, Bad yes, no
Discrete
Number of Students
Continuous
Tempature
Continuous
Temperature
data imputation
involves substituting an estimated value for a missing value. There are various approaches to making the estimation: averaging the non-missing values, taking the most common of the non-missing values, or even taking a random value from the non-missing data. At the end of the day, the analyst needs to decide carefully whether to remove rows with missing data or to impute values for the missing data.
What to do if the missing data is random
remove those rows guuurrll
If the missing data isn’t random?
If the missing data isn’t random, however, it’s usually better to impute. This will keep you from introducing bias into your data set.
Population
domain of interest ex, women between 25-35 years of age
sample
subset of that population