BUSI344 / CHAPTER 3 Flashcards by Dennis Ragauskas

___________________ is an approach to learning from data.

A by-product of both the computer revolution and the growth of the Internet has been an exponential growth in the amount and complexity of available data.

Exploratory Data Analysis (EDA) is an approach to learning from data.

A by-product of both the computer revolution and the growth of the Internet has been an exponential growth in the amount and complexity of available data.

How well did you know this?

Not at all

Perfectly

Give examples of ratio variables

Variables like height, weight, enzyme activity are ratio variables.

How well did you know this?

Not at all

Perfectly

Interval Variables

Interval variables have a relationship between them, e.g., House A built in 2011 is five years newer than House B built in 2006. However, the relative “distance” between the two does not have a direct mathematical relationship or meaning. For example, is House A 2011 2006 = 1.00249 times better than House B based on year built? Probably not.

How well did you know this?

Not at all

Perfectly

EDA is a way to help _ _ _ _ _ _

EDA is a way to help make sense of the vast data facing today’s business analyst.

How well did you know this?

Not at all

Perfectly

The EDA process is about _________, __________, and _______ information.

The EDA process is about evaluating, synthesizing, and leveraging information.

How well did you know this?

Not at all

Perfectly

A set of data involves a number of “variables” and “observations” or “cases”. The “cases” are the observations accumulated into the dataset, such as 550 property sales, 120 leases, or 14,562 automobile purchasers.

The “variables” are the characteristics of the cases, such as the number of bedrooms, square footage, base rents, or car colour preferences.

How well did you know this?

Not at all

Perfectly

BINARY VARIABLES

Binary variables (or dummy variables) are a special case of a discrete variable used for non-numeric variables, such as location, building features, and views. A binary variable, as the name implies, has only two possible values; the classic example is on and off. These are most often used in data analysis to indicate the presence or absence of a particular characteristic.

How well did you know this?

Not at all

Perfectly

Explain ordinal data?

Ordinal data is a categorical, statistical data type where the variables have natural, ordered categories and the distances between the categories is not known.

Examples of ordinal data

A well-known example of ordinal data is the Likert scale.

Examples of ordinal data are often found in questionnaires: for example, the survey question “Is your general health poor, reasonable, good, or excellent?” may have those answers coded respectively as 1, 2, 3, and 4. Sometimes data on an interval scale or ratio scale are grouped onto an ordinal scale: for example, individuals whose income is known might be grouped into the income categories $0-$19,999, $20,000-$39,999, $40,000-$59,999, …, which then might be coded as 1, 2, 3, 4, …. Other examples of ordinal data include socioeconomic status, military ranks, and letter grades for coursework.

How well did you know this?

Not at all

Perfectly

Explain what a nominal variable is?

A nominal variable is another name for a categorical variable.

Nominal variables have two or more categories without having any kind of natural order. they are variables with no numeric value, such as occupation or political party affiliation.

How well did you know this?

Not at all

Perfectly

What is an interval variable?

A interval variable is a measurement where the difference between two values is meaningful. The difference between a temperature of 100 degrees and 90 degrees is the same difference as between 90 degrees and 80 degrees.

How well did you know this?

Not at all

Perfectly

What is a ratio variable?

A ratio variable, has all the properties of an interval variable, and also has a clear definition of 0.0. When the variable equals 0.0, there is none of that variable.

Variables like height, weight, enzyme activity are ratio variables.

Temperature, expressed in F or C, is not a ratio variable. A temperature of 0.0 on either of those scales does not mean ‘no heat’.

However, temperature in Kelvin is a ratio variable, as 0.0 Kelvin really does mean ‘no heat’.

Another counter example is pH. It is not a ratio variable, as pH=0 just means 1 molar of H+. and the definition of molar is fairly arbitrary. A pH of 0.0 does not mean ‘no acidity’ (quite the opposite!). W

hen working with ratio variables, but not interval variables, you can look at the ratio of two measurements. A weight of 4 grams is twice a weight of 2 grams, because weight is a ratio variable.

A temperature of 100 degrees C is not twice as hot as 50 degrees C, because temperature C is not a ratio variable. A pH of 3 is not twice as acidic as a pH of 6, because pH is not a ratio variable.

Ratio variables on the other hand can tell us a great deal. For example, a 5,000 square metre warehouse is twice as large as a 2,500 square metre one. A house with 50 metres of waterfront has 25% more than a house with 40 metres.

How well did you know this?

Not at all

Perfectly

Define Ratio Variable

Variable ratio definition

Ratio variables are interval variables, but with the added condition that 0 (zero) of the measurement indicates that there is none of that variable. So, temperature measured in degrees Celsius or Fahrenheit is not a ratio variable because 0C does not mean there is no temperature.

How well did you know this?

Not at all

Perfectly

BINARY VARIABLES
A SPECIAL CASE OF

Binary variables (or dummy variables) are a special case of a discrete variable used for non-numeric variables, such as location, building features, and views.

How well did you know this?

Not at all

Perfectly

4 Rs OF EDA

The “four Rs” of EDA: reduction, revelation, re-expression, and residuals (from models).

How well did you know this?

Not at all

Perfectly

REDUCTION

Reduction means simplifying the information, focusing it to a small enough “package” that it becomes comprehensible. As an analogy, consider how the term “reduction” is used in cooking: boiling down a soup until the excess liquid is evaporated, with the broth becoming increasingly concentrated. The same process is used in data analysis: “boil it down” until the essential elements become clear.

How well did you know this?

Not at all

Perfectly

TRUE OR FALSE?
Not all variation can be explained; some variation is random.

ANSWER: TRUE

NOTE ONLY

Real estate price data would typically be considered stochastic or non-deterministic. This means there is a random element in it which precludes fully explaining all variation with 100% accuracy. For example, you can use past sales transactions to partially forecast future prices, but never with complete accuracy. In contrast, deterministic events are those completely explained by existing or past causes and with no uncertainty whatsoever - e.g., when a tree releases an apple, gravity causes it to fall to the ground.

NOTE ONLY

Real estate price data would typically be considered _______ or _____________

Real estate price data would typically be considered stochastic or non-deterministic.

DATA DISTINCTIONS

Another related distinction in data is its smooth and rough components.

RE-EXPRESSION

Re-expressing the size-unit price variables by their natural logs linearizes the relationship.

SMOOTH PROPERTIES IN DATA

The smooth properties in data are those explained by measures of central tendency, dispersion, and distribution.

SMOOTH / ROUGH DATA ANALYSIS

Smooth elements are analyzed using the descriptive statistics discussed in the “Reduction” section; rough elements are analyzed using features that showcase outliers, such as boxplots, scatterplots, and histograms.

NOTE ONLY

The relationship in scatterplots can be made even clearer by fitting a line or a curve to the data. This is called smoothing the data, or in other words explaining the “smooth” element - again, the “rough” element being how the data points vary from this line.

NOTE ONLY

RE-EXPRESSION

Continuous data, including real estate data, often displays a non-linear function. In order to fit this to a linear model, the data may need to be transformed. When you view a scatterplot of the data, if the graph(s) suggest the link between X and Y is not linear, either X or Y or both can be transformed (re-expressed) by their logarithmic functions (e.g., natural log or log base l0), different powers, their square root, or their inverse. By transforming the variable, the resulting graph may show a linear relationship.

**HETEROSKEDASTICITY**

**This is a problem called heteroskedasticity. If the residual graph shows heteroskedasticity, then you should change the way you measure your variable. Usually the problem is that the errors get larger over time, partly because the size of the variable gets larger over time - for example, real estate prices tend to increase over time, so a forecast of future real estate prices will have errors that become more and more magnified.**

**ROUGH DATA ELEMENTS**

**The rough elements are those outside these smooth considerations, such as the residual between a given data point and the line of best fit or, at the extreme, outliers that are outside the range of reasonableness.**

**OUTLIERS**

**Outliers can mask important trends by reducing the resolution of a plot. They can also significantly influence the results of an analysis. We must consider why any outlier value is extraordinary before deciding to keep it or discard. In other words, we must weigh the benefits against the costs: on the one hand, by removing an outlier we may be able to better identify the relationships in the data, but we must be careful not to "over-manage" our data.**

**RESIDUALS**

**Estimated or predicted values from a model will always differ from observed values. In real estate valuation, for example, the market value estimate will rarely be the same as the actual selling price; in a mass appraisal for assessment purposes, the assessed values on the whole will vary above and below their observed selling prices for properties that sold near the valuation. These differences are called residuals.**

**NOMINAL VARIABLES**

**Nominal variables contain less information than ordinal variables - there is absolutely no relationship between any two, other than the fact that they are different. The values for the variables are those of convenience rather than for analysis, e.g., type of construction: wood frame, concrete, or steel. It is difficult to add these together or get a maximum or minimum.**

**INTERVAL VARIABLES**

***Interval variables* have a relationship between them, e.g., House A built in 2011 is five years newer than House B built in 2006. However, the relative "distance" between the two does not have a direct mathematical relationship or meaning. For example, is House A 2011 2006 = 1.00249 times better than House B based on year built? Probably not.**

**BOXPLOTS MOST USEFUL**

**Boxplots are most useful for a continuous variable such as Price against a discrete variable where there are only a few possible values: e.g., bedrooms (number), presence of pool (yes/no), neighbourhood (number or letter code). The boxplot will very quickly show you whether the groups are equal or whether there are significant variations.**

**Heteroskedasticity**

**The errors/residuals in the graph on the left increase from the beginning of the data to the end, with the variance getting wider with each successive year. This is a problem called *heteroskedasticity.* If the residual graph shows heteroskedasticity, then you should change the way you measure your variable.** **Usually the problem is that the errors get larger over time, partly because the size of the variable gets larger over time - for example, real estate prices tend to increase over time, so a forecast of future real estate prices will have errors that become more and more magnified.** **The way to fix this is to transform the variable so that it does not grow over time. Mathematically, a function called the natural logarithm is the ideal solution for any numbers that grow exponentially over time - you will need to re-express your data in a format that works.**

**SMOOTH DATA PROPERTIES EXPLAINED BY . . .**

**The smooth properties in data are those explained by measures of central tendency, dispersion, and distribution.**

**ROUGH DATA ELEMENTS**

**ANALYSIS OF ROUGH DATA ELEMENTS**

**Rough elements are analyzed using features that showcase outliers, such as boxplots, scatterplots, and histograms.**

**BOXPLOTS MOST USEFUL**

**TRUE OR FALSE? Boxplots are not effective for analyzing two continuous variables**

**ANSWER TRUE**