Midterm Flashcards
VARIABLES
VARIABLE ASPECTS OF REALITY
(In statistical research, a variable is defined as an attribute of an object of study.)
- VARIABLES CONSIST OF
VALUES
σ
Sigma
Sigma represents
population standard deviation
pulation standard deviation formula
µ means
mean
VARIABLE ASPECTS OF REALITY ARE CALLED
VARIABLES
VARIABLES CONSIST OF
VALUES
VALUES ARE TAKEN ON BY
OBSERVATIONS (SUBJECTS)
VALUES ARE TAKEN ON BY OBSERVATIONS (SUBJECTS) IN
TIME AND IN SPACE
WE MAY WANT TO DO TWO THINGS WITH VALUES OF OBSERVATIONS:
- WE MAY WANT TO KNOW IF THERE IS A PATTERN IN A LIMITED NUMBER OF VALUES AVAILABLE TO US “HERE AND NOW” (IS THERE A PATTERN OF SCORING BY A BASKETBALL TEAM OVER A SEASON?)
• THIS GOAL CAN BE ACCOMPLISHED WITH A SET OF STATISTICAL PROCEDURES, CALLED DESCRIPTIVE STATISTICS. - b. WE MAY ALSO WANT TO KNOW IF A PATTERN OBSERVED IN A LIMITED NUMBER OF OBSERVATIONS IS LIKELY TO HOLD WITH OTHER OBSERVATIONS UNDER SIMILAR CONDITIONS. (ARE OTHER TEAMS IN THE LEAGUE LIKELY TO DISPLAY A SIMILAR SCORING PATTERN OVER A SEASON AS THE TEAM WE HAVE OBERVED?)
• THIS GOAL CAN BE ACCOMPLISHED WITH A SET OF STATISTICAL PROCEDURES, KNOWN AS INFERENTIAL STATISTICS.
DESCRIPTIVE STATISTICS
is a means of describing features of a data set by generating summaries about data samples. It’s often depicted as a summary of data shown that explains the contents of data.
INFERENTIAL STATISTICS
describe the many ways in which statistics derived from observations on samples from study populations can be used to deduce whether or not those populations are truly different.
OBSERVATIONS THAT WE OBSERVE “HERE AND NOW” MAKE UP A
SAMPLE
TO DESCRIBE A SAMPLE WE USE
SAMPLE STATISTICS
SAMPLE STATISTICS ARE REFERRED TO BY
LATIN LETTERS
A SET OF ALL RELEVANT OBSERVATIONS FROM WHICH YOUR SAMPLE WAS TAKEN IS CALLED A
POPULATION
TO DESCRIBE A POPULATION, WE USE
POPULATION PARAMETERS
POPULATION PARAMETERS ARE REFERRED TO BY
GREEK LETTERS
THE PROBLEM WITH A POPULATION IS THAT IT’S DIFFICULT TO OBSERVE. THEREFORE, WE USUALLY OBSERVE PATTERNS IN SAMPLES AND DECIDE IF THESE PATTERNS ARE LIKELY TO
HOLD IN POPULATIONS
SAMPLES MUST BE
REPRESENTATIVE
SAMPLES MUST BE REPRESENTATIVE:
THEY MUST REFLECT GENERAL COMPOSITION OF POPULATION
SAMPLES MUST BE SELECTED VIA
RANDOM SAMPLING
SAMPLES MUST BE SELECTED VIA RANDOM SAMPLING:
WHERE EACH OBSERVATION IN A POPULATION HAS IDENTICAL PROBABILITY OF BEING SELECTED INTO A SAMPLE.
SAMPLING WITH/ WITHOUT REPLACEMENT
SAY WE HAVE 5 RED BALLS, 5 WHITE ONES, AND SELECT A SAMPLE OF 2 BALLS. PROBABILITY OF A 2ND BALL BEING RED DEPENDS ON THE COLOR OF THE 1ST BALL PICKED INTO THE SAMPLE. THIS VIOLATES EQUAL PROBABILITY PRINCIPLE FOR THE SECOND BALL. TO AVOID THE VIOLATION WE REPLACE THE 1ST BALL BEFORE PICKING THE 2ND ONE. REPLACEMENT IS NOT NECESSARY WITH LARGE POPULATIONS.
A PERFECT FIT BETWEEN A SAMPLE AND A POPULATION DOES NOT EXIST. THERE’S ALWAYS A
SAMPLING ERROR
A SAMPLING ERROR IS
THE DIFFERENCE BETWEEN A POPULATION PARAMETER AND A SAMPLE STATISTIC
TWO TYPES OF SAMPLING ERRORS:
• THE RELATIVELY “HARMLESS” SAMPLING ERROR IS UNBIASED
• THE “HARMFUL” SAMPLING ERROR IS BIASED
THE RELATIVELY “HARMLESS” SAMPLING ERROR IS UNBIASED:
OVER MULTIPLE SAMPLES SOME SAMPLE STATISTICS WILL BE GREATER AND SOME – SMALLER THAN POPULATION PARAMETER.
THE “HARMFUL” SAMPLING ERROR IS BIASED:
OVER MULTIPLE SAMPLES SOME SAMPLE STATISTICS ALL OF THEM WILL BE EITHER GREATER OR SMALLER THAN POPULATION PARAMETER.
THE BIAS OF A SAMPLING ERROR CAN BE DETECTED BY
BY INVESTIGATING SAMPLING PROCEDURE. (BECAUSE FULL POPULATIONS AND THEIR PARAMETERS ARE USUALLY UNOBSERVABLE). FOR A SAMPLING ERROR TO BE UNBIASED, SAMPLING PROCEDURE MUST ENSURE EQUAL PROBABILITY OF SELECTION FOR EACH OBSERVATION IN POPULATION.
TYPES OF VARIABLES
NATURE VARIABLES
DISCRETE VARIABLES
CONTINUOUS VARIABLES
NATURE VARIABLES CAN BE
DISCRETE OR CONTINUOUS
DISCRETE VARIABLES HAVE MEASUREMENT UNITS THAT ARE
CLEARLY DEFINED WITH NO INTERIM VALUES FALLING BETWEEN TWO SMALLEST POSSIBLE UNITS
DISCRETE VALUES ARE OFTEN USED TO
DENOTE QUALITIES (FEMALE / MALE 1 / 2)
DISCRETE VARIABLES USUALLY HAVE
RELATICELY FEW VALUES (TYPES OF A MEDAL: GOLD, SILVER, BRONZE), BUT SOME CAN HAVE A LARGER NUMBER OF VALUES (THE AMOUNT OF ONE-CENT COINS IN YOUR POCKET).
CONTINUOUS VARIABLES DO NOT HAVE A
CLEARLY DEFINED SMALLEST VALUE. (TIME, TEMPERATURE, ETC.) VALUES COULD IN PRINCIPLE CONTINUE TO INFINITY IN BETWEEN ANY TWO GIVEN OBSERVATIONS.
CONTINUOUS VARIABLES NEVER HAVE
THE SAME VALUE FOR ANY TWO OBSERVATIONS. NO TWO PEOPLE ARE 170 CM TALL. WE ONLY HAVE SAME-SOUNDING VALUES, BECAUSE OUR MEASURMENT DEVICES CANNOT PICK-UP FINER SUB-UNITS.
TO GIVE A VALUE OF A CONTINUOUS VARIABLE PECISELY, YOU SHOULD
INDICATE ITS UPPER AND LOWER REAL LIMITS AT A DESIRED INTERVAL. LET’S SAY THAT A DESIRED INTERVAL IS 1 CM. A PERSON WITH A HEIGHT OF 170 CM IS THEN SAID TO BE BETWEEN LRL = 169.5 CM & URL = 170.5 CM
BY DEFINITION 170.5 IS THE URL OF AN INTERVAL
“71”
SAME VARIABLE CAN BE MEASURED WITH DIFFERENT DEGREE OF PRECISION WITH DISTINCT
MEASUREMENT SCALES
NOMINAL SCALE (NAMES) VALUES SIMPLY PERFORM
THE FUNCTION OF NAMES. NO MATHEMATICAL OPERATIONS CAN BE ACCOMPLISHED WITH NOMINAL SCALE. (NUMBERS ON BASKETBALL JERSEYS, RANDOMLY ASSIGNED TO PLAYERS)
ORDINAL SCALE (RANKINGS). VALUES CAN BE USED TO
RANK OBSERVATIONS IN ORDER OF MAGNITUDE. (NUMBERS ON BASKETBALL JERSEYS, ASSIGNED ACCORDING TO HEIGHT). NO MATHEMATICAL OPERATIONS, EXCEPT FOR RANKING, CAN BE CONDUCTED ON THIS SCALE.
NOMINAL AND ORDINAL SCALES CAN BE USED TO
MEASURE: LOW, MEDIUM AND HIGH PRESSURE IS AN ORDINAL MEASURE). TO MEASURE BOTH DISCRETE AND CONTINUOUS VARIABLES (WHILE BLOOD PRESSURE IS A CONTINUOUS VARIABLE, ITS MEASURE: SYSTOLIC OR DIASTOLIC IS A NOMINAL SCALE MEASURE, WHILE ANOTHER
STILL YOU SHOULD AVOID MEASURING CONTINUOUS VARIABLES ON NOMINAL OR ORDINAL SCALE, BECAUSE
THIS WAY YOU LOSE PRECISION THAT CAN BE OBTAINED WITH MORE SOPHISTICATED SCALES.
INTERVAL SCALE. ENABLES NOT ONLY RANKING, BUT
BUT MEASURING MEANINGFUL DIFFERENCE BETWEEN VALUES OF A VARIABLE.
INTERVAL SCALE MEASUREMENTS DO NOT HAVE
AN ABSOLUTE ZERO (SOMETIMES KNOWN AS AN ABSOLUTE ZERO, AT WHICH A VARIABLE CEASES TO EXIST.
ALL MATHEMATICAL OPERATIONS CAN BE DONE WITH
VARIABLES MEASURED ON INTERVAL SCALE, EXCEPT FOR TAKING A RATIO (CANNOT DIVIDE). CONSIDER WAKING UP AT 4AM WHILE YOU NORMALLY WAKE UP AT 8 AM. DOES THAT MEAN THAT YOU WOKE UP TWICE AS EARLY? NO (BECAUSE TIME DID NOT START AT MIDNIGHT).
RATIO SCALE. APPLIES TO
CONTINUOUS VARIABLES THAT HAVE AN ABSOLUTE ZERO. ALL MATHEMATICAL OPERATIONS POSSIBLE.
INTERVAL AND RATIO SCALES USUALLY MEASURE
MEASURE CONTINUOUS VARIABLES. YOU SHOULD USE THESE TWO SCALES TO MEASURE CONTINUOUS VARIABLES, INSTEAD OF USING NOMINAL OR ORDINAL SCALES FOR THE RICHNESS OF INFORMATION.
VARIABLES ARE USUALLY REFERRED TO WITH
LATIN UPPER-CASE LETTERS (X, Y, Q…)
VALUES ARE USUALLY REFERRED TO WITH
LATIN LOWER-CASE LETTERS WITH SUBSCRIPTS (x1, x2, x3 … xn).
THE NUMBER OF OBSERVATIONS IN A POPULATION IS MARKED WITH
UPPER CASE N
A SUM OF VALUES OF A PARTICULAR VARIABLE IS KNOWN BY
UPPER CASE GREEK LETTER SIGMA: Σ. SIGMA MUST ALWAYS BE FOLLOWED BY WHATEVER IS BEING ADDED.
a. LETS SAY WE HAVE A SAMPLE OF n = 4, 3, 6, 7.
• Σ(X) = 20
• Σ(X – 1)2 = 9 + 4 + 25 + 36 = 74
• (ΣX)2 = 202 = 400.
FREQUENCY DISTRIBUTIONS
THE FIRST TOOL FOR DESCRIPTIVE STATISTICS
FD SHOW
WHICH VALUES IN A VARIABLE OCCUR FREQUENTLY, AND WHICH ARE RARE
USUALLY, FD ARE
GRAPHIC REPRESENTATIONS OF DATA, BUT THEY BEGIN WITH A FREQUENCY TABLE.
FREQUENCY TABLES LIST VALUES OF A VARIABLE IN THE
LFTMOST COLUMN. ALL POSSIBLE VALUE MUST BE LISTED.
AN ADJACENT COLUMN CONTAINS
FREQUENCIES (f) OF EACH VALUE: NUMBERS OF OBSERVATIONS IN A SAMPLE THAT HAVE A PARTICULAR VALUE
A FREQUENCY TABLE MAY CONTAIN
RELATIVE FREQUENCIES (rf, %): SHARES OF OBSERVATIONS (FROM THE TOTAL n) THAT HAVE A PARTICULAR VALUE.
A FREQUENCY TABLE MAY CONTAIN CUMULATIVE FREQUENCIES (cf):
NUMBERS OF OBSERVATIONS THAT HAVE VALUES THAT ARE EQUAL TO OR LOWER THAN A GIVEN VALUE.
A FREQUENCY TABLE MAY CONTAIN CUMULATIVE RELATIVE FREQUENCIES (crf, c%):
SHARES OF OBSERVATIONS THAT HAVE VALUES THAT ARE EQUAL TO OR LOWER THAN THE VALUE.
CUMULATIVE RELATIVE FREQUENCY IS USEFUL FOR
SHOWING A RELATIVE STANDING OF AN OBSERVATION WITH A PARTICULAR VALUE VIS-À-VIS OTHER OBSERVATIONS.
THE CONCEPT OF PERCENTILE RANK.
PERCENTILE RANK SHOWS RELATIVE STANDING OF AN OBSERVATION’S VALUE AMONG OTHER VALUES.
RECENTILE RANK SHOWS
THE PERCENT OF OBSERVATIONS WITH VALUES EQUAL TO OR LOWER THAN A GIVEN VALUE. A STUDENT EARNING A GRADE WITH PERCENTILE RANK OF 70 HAS DONE AS WELL OR BETTER THAN 70% OF OTHER STUDENTS.
FOR CONTINUOUS VARIABLES, OR DISCRETE ONES WITH MANY POSSIBLE VALUES, THE CONTENT OF THE LEFT COLUMN IN F.T. HAS TO BE
CLUSTERED IN TO GROUPS OF EQUAL SIZE WITH APPROXIMATELY 8 – 10 SUCH GROUPS.
INTERPOLATION:
MAKING AN EDUCATED GUESS ABOUT THE LIKELY CRF OF A VALUE IN THE MIDDLE OF AN INTERVAL.
WHAT IF YOU HAVE A CONTINUOUS VARIABLE?
• USE A POLYGON
• USE A HISTOGRAM (SHOWN FOR A SEPATATE SET OF VALUES)
FOR STARTERS: STEM AND LEAF DIAGRAM
AN ALTERNATIVE WAY OF VISUALIZING F.D. OF CONTINUOUS VARIABLES (JOHN TUKEY).
Just understand this table:
FREQUENCY DISTRIBUTIONS CAN HAVE A GREAT VARIERTY OF SHAPES. LETS LEARN SOME WORDS TO DESCRIBE THEM.
THE KEY POINT OF DEPARTURE, TALKING ABOUT SHAPES IS THE CONCEPT OF
SYMMETRY