Sect 3.1- Journals, design, analysis, bioinformatics, T test, & linear regression Flashcards

1
Q

Data for statistical studies are obtained by conducting either experiments or surveys. Experimental design is the branch of statistics that deals with the design and analysis of experiments.

A

The methods of experimental design are used in agriculture, medicine, biology, marketing research and industrial production.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

consider an experiment designed to determine the effect of three different exercise programs on the cholesterol level of patients with elevated cholesterol. Each patient is referred to as an experimental unit, the response variable is the cholesterol level of the patient at the completion of the program, and the exercise program is the

A

factor whose effect on cholesterol level is being investigated. Each of the three exercise programs is referred to as a treatment.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Three of the more widely used experimental designs are the

A

completely randomized design, the randomized block design, and the factorial design. In a completely randomized experimental design, the treatments are randomly assigned to the experimental units. For instance, applying this design method to the cholesterol-level study, the three types of exercise program (treatment) would be randomly assigned to the experimental units (patients).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

The use of a completely randomized design will yield less precise results when factors not accounted for by the experimenter affect the response variable.

A

an experiment designed to study the effect of two different gasoline additives on the fuel efficiency, measured in miles per gallon (mpg), of full-size automobiles produced by three manufacturers. Suppose that 30 automobiles, 10 from each manufacturer, were available for the experiment. In a completely randomized design the two gasoline additives (treatments) would be randomly assigned to the 30 automobiles, with each additive being assigned to 15 different cars. Suppose that manufacturer 1 has developed an engine that gives its full-size cars a higher fuel efficiency than those produced by manufacturers 2 and 3. A completely randomized design could, by chance, assign gasoline additive 1 to a larger proportion of cars from manufacturer 1. In such a case, gasoline additive 1 might be judged to be more fuel efficient when in fact the difference observed is actually due to the better engine design of automobiles produced by manufacturer 1. To prevent this from occurring, a statistician could design an experiment in which both gasoline additives are tested using five cars produced by each manufacturer; in this way, any effects due to the manufacturer would not affect the test for significant differences due to gasoline additive. In this revised experiment, each of the manufacturers is referred to as a block, and the experiment is called a randomized block design. In general, blocking is used in order to enable comparisons among the treatments to be made within blocks of homogeneous experimental units.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Factorial experiments are designed to draw conclusions about more than one factor, or variable.

A

The term factorial is used to indicate that all possible combinations of the factors are considered. For instance, if there are two factors with a levels for factor 1 and b levels for factor 2, the experiment will involve collecting data on ab treatment combinations. The factorial design can be extended to experiments involving more than two factors and experiments involving partial factorial designs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Analysis of variance and significance testing

A

A computational procedure frequently used to analyze the data from an experimental study employs a statistical procedure known as the analysis of variance. For a single-factor experiment, this procedure uses a hypothesis test concerning equality of treatment means to determine if the factor has a statistically significant effect on the response variable. For experimental designs involving multiple factors, a test for the significance of each individual factor as well as interaction effects caused by one or more factors acting jointly can be made. Further discussion of the analysis of variance procedure is contained in the subsequent section.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Regression and correlation analysis

A

Regression analysis involves identifying the relationship between a dependent variable and one or more independent variables. A model of the relationship is hypothesized, and estimates of the parameter values are used to develop an estimated regression equation. Various tests are then employed to determine if the model is satisfactory. If the model is deemed satisfactory, the estimated regression equation can be used to predict the value of the dependent variable given values for the independent variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Regression model

A

In simple linear regression, the model used to describe the relationship between a single dependent variable y and a single independent variable x is y = β0 + β1x + ε. β0 and β1 are referred to as the model parameters, and ε is a probabilistic error term that accounts for the variability in y that cannot be explained by the linear relationship with x. If the error term were not present, the model would be deterministic; in that case, knowledge of the value of x would be sufficient to determine the value of y.

In multiple regression analysis, the model for simple linear regression is extended to account for the relationship between the dependent variable y and p independent variables x1, x2, . . ., xp. The general form of the multiple regression model is y = β0 + β1x1 + β2x2 + . . . + βpxp + ε. The parameters of the model are the β0, β1, . . ., βp, and ε is the error term.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Least squares method

A

Either a simple or multiple regression model is initially posed as a hypothesis concerning the relationship among the dependent and independent variables. The least squares method is the most widely used procedure for developing estimates of the model parameters. For simple linear regression, the least squares estimates of the model parameters β0 and β1 are denoted b0 and b1. Using these estimates, an estimated regression equation is constructed: ŷ = b0 + b1x . The graph of the estimated regression equation for simple linear regression is a straight line approximation to the relationship between y and x.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

As an illustration of regression analysis and the least squares method, suppose a university medical centre is investigating the relationship between stress and blood pressure. Assume that both a stress test score and a blood pressure reading have been recorded for a sample of 20 patients.

A

Values of the independent variable, stress test score, are given on the horizontal axis, and values of the dependent variable, blood pressure, are shown on the vertical axis. The line passing through the data points is the graph of the estimated regression equation: ŷ = 42.3 + 0.49x. The parameter estimates, b0 = 42.3 and b1 = 0.49, were obtained using the least squares method.
A primary use of the estimated regression equation is to predict the value of the dependent variable when values for the independent variables are given. For instance, given a patient with a stress test score of 60, the predicted blood pressure is 42.3 + 0.49(60) = 71.7. The values predicted by the estimated regression equation are the points on the line in Figure 4, and the actual blood pressure readings are represented by the points scattered about the line. The difference between the observed value of y and the value of y predicted by the estimated regression equation is called a residual. The least squares method chooses the parameter estimates such that the sum of the squared residuals is minimized.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Analysis of variance and goodness of fit

A

A commonly used measure of the goodness of fit provided by the estimated regression equation is the coefficient of determination. Computation of this coefficient is based on the analysis of variance procedure that partitions the total variation in the dependent variable, denoted SST, into two parts: the part explained by the estimated regression equation, denoted SSR, and the part that remains unexplained, denoted SSE.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

The measure of total variation, SST, is the sum of the squared deviations of the dependent variable about its mean: Σ(y − ȳ)2. This quantity is known as the total sum of squares. The measure of unexplained variation, SSE, is referred to as the residual sum of squares. For the data in Figure 4, SSE is the sum of the squared distances from each point in the scatter diagram (see Figure 4) to the estimated regression line: Σ(y − ŷ)2. SSE is also commonly referred to as the error sum of squares. A key result in the analysis of variance is that SSR + SSE = SST.

A

The ratio r2 = SSR/SST is called the coefficient of determination. If the data points are clustered closely about the estimated regression line, the value of SSE will be small and SSR/SST will be close to 1. Using r2, whose values lie between 0 and 1, provides a measure of goodness of fit; values closer to 1 imply a better fit. A value of r2 = 0 implies that there is no linear relationship between the dependent and independent variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

When expressed as a percentage, the coefficient of determination can be interpreted as the percentage of the total sum of squares that can be explained using the estimated regression equation. For the stress-level research study, the value of r2 is 0.583; thus, 58.3% of the total sum of squares can be explained by the estimated regression equation ŷ = 42.3 + 0.49x.

A

For typical data found in the social sciences, values of r2 as low as 0.25 are often considered useful. For data in the physical sciences, r2 values of 0.60 or greater are frequently found.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Significance Testing

A

In a regression study, hypothesis tests are usually conducted to assess the statistical significance of the overall relationship represented by the regression model and to test for the statistical significance of the individual parameters. The statistical tests used are based on the following assumptions concerning the error term: (1) ε is a random variable with an expected value of 0, (2) the variance of ε is the same for all values of x, (3) the values of ε are independent, and (4) ε is a normally distributed random variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

The mean square due to regression, denoted MSR, is computed by dividing SSR by a number referred to as its degrees of freedom; in a similar manner, the mean square due to error, MSE, is computed by dividing SSE by its degrees of freedom. An F-test based on the ratio MSR/MSE can be used to test the statistical significance of the overall relationship between the dependent variable and the set of independent variables.

A

large values of F = MSR/MSE support the conclusion that the overall relationship is statistically significant. If the overall model is deemed statistically significant, statisticians will usually conduct hypothesis tests on the individual parameters to determine if each independent variable makes a significant contribution to the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Residual Analysis

A

The analysis of residuals plays an important role in validating the regression model. If the error term in the regression model satisfies the four assumptions noted earlier, then the model is considered valid. Since the statistical tests for significance are also based on these assumptions, the conclusions resulting from these significance tests are called into question if the assumptions regarding ε are not satisfied.

The ith residual is the difference between the observed value of the dependent variable, yi, and the value predicted by the estimated regression equation, ŷi. These residuals, computed from the available data, are treated as estimates of the model error, ε. As such, they are used by statisticians to validate the assumptions concerning ε. Good judgment and experience play key roles in residual analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Graphical plots and statistical tests concerning the residuals are examined carefully by statisticians, and judgments are made based on these examinations. The most common residual plot shows ŷ on the horizontal axis and the residuals on the vertical axis.

A

If the assumptions regarding the error term, ε, are satisfied, the residual plot will consist of a horizontal band of points. If the residual analysis does not indicate that the model assumptions are satisfied, it often suggests ways in which the model can be modified to obtain better results.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Model building

A

In regression analysis, model building is the process of developing a probabilistic model that best describes the relationship between the dependent and independent variables. The major issues are finding the proper form (linear or curvilinear) of the relationship and selecting which independent variables to include. In building models it is often desirable to use qualitative as well as quantitative variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

quantitative variables measure how much or how many; qualitative variables represent types or categories. For instance, suppose it is of interest to predict sales of an iced tea that is available in either bottles or cans. Clearly, the independent variable “container type” could influence the dependent variable “sales.” Container type is a qualitative variable, however, and must be assigned numerical values if it is to be used in a regression study. So-called dummy variables are used to represent qualitative variables in regression analysis. For example, the dummy variable x could be used to represent container type by setting x = 0 if the iced tea is packaged in a bottle and x = 1 if the iced tea is in a can. If the beverage could be placed in glass bottles, plastic bottles, or cans, it would require two dummy variables to properly represent the qualitative variable container type. In general, k - 1 dummy variables are needed to model the effect of a qualitative variable that may assume k values.

A

The general linear model y = β0 + β1x1 + β2x2 + . . . + βpxp + ε can be used to model a wide variety of curvilinear relationships between dependent and independent variables. For instance, each of the independent variables could be a nonlinear function of other variables. Also, statisticians sometimes find it necessary to transform the dependent variable in order to build a satisfactory model. A logarithmic transformation is one of the more common types.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Correlation

A

Correlation and regression analysis are related in the sense that both deal with relationships among variables. The correlation coefficient is a measure of linear association between two variables. Values of the correlation coefficient are always between −1 and +1. A correlation coefficient of +1 indicates that two variables are perfectly related in a positive linear sense, a correlation coefficient of −1 indicates that two variables are perfectly related in a negative linear sense, and a correlation coefficient of 0 indicates that there is no linear relationship between the two variables. For simple linear regression, the sample correlation coefficient is the square root of the coefficient of determination, with the sign of the correlation coefficient being the same as the sign of b1, the coefficient of x1 in the estimated regression equation.

Neither regression nor correlation analyses can be interpreted as establishing cause-and-effect relationships. They can indicate only how or to what extent variables are associated with each other. The correlation coefficient measures only the degree of linear association between two variables. Any conclusions about a cause-and-effect relationship must be based on the judgment of the analyst.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Time series and forecasting

A

A time series is a set of data collected at successive points in time or over successive periods of time. A sequence of monthly data on new housing starts and a sequence of weekly data on product sales are examples of time series. Usually the data in a time series are collected at equally spaced periods of time, such as hour, day, week, month, or year.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

A primary concern of time series analysis is the development of forecasts for future values of the series.

A

For instance, the federal government develops forecasts of many economic time series such as the gross domestic product, exports, and so on. Most companies develop forecasts of product sales.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

While in practice both qualitative and quantitative forecasting methods are utilized, statistical approaches to forecasting employ quantitative methods. The two most widely used methods of forecasting are the Box-Jenkins autoregressive integrated moving average (ARIMA) and econometric models.

A

ARIMA methods are based on the assumption that a probability model generates the time series data. Future values of the time series are assumed to be related to past values as well as to past errors. A time series must be stationary, i.e., one which has a constant mean, variance, and autocorrelation function, in order for an ARIMA model to be applicable. For nonstationary series, sometimes differences between successive values can be taken and used as a stationary series to which the ARIMA model can be applied.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Econometric models

A

develop forecasts of a time series using one or more related time series and possibly past values of the time series. This approach involves developing a regression model in which the time series is forecast as the dependent variable; the related time series as well as the past values of the time series are the independent or predictor variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Nonparametric methods

A

The statistical methods discussed above generally focus on the parameters of populations or probability distributions and are referred to as parametric methods. Nonparametric methods are statistical methods that require fewer assumptions about a population or probability distribution and are applicable in a wider range of situations. For a statistical method to be classified as a nonparametric method, it must satisfy one of the following conditions: (1) the method is used with qualitative data, or (2) the method is used with quantitative data when no assumption can be made about the population probability distribution. In cases where both parametric and nonparametric methods are applicable, statisticians usually recommend using parametric methods because they tend to provide better precision. Nonparametric methods are useful, however, in situations where the assumptions required by parametric methods appear questionable. A few of the more commonly used nonparametric methods are described below.
Assume that individuals in a sample are asked to state a preference for one of two similar and competing products. A plus (+) sign can be recorded if an individual prefers one product and a minus (−) sign if the individual prefers the other product. With qualitative data in this form, the nonparametric sign test can be used to statistically determine whether a difference in preference for the two products exists for the population. The sign test also can be used to test hypotheses about the value of a population median.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

The Wilcoxon signed-rank test can be used to test hypotheses about two populations. In collecting data for this test, each element or experimental unit in the sample must generate two paired or matched data values, one from population 1 and one from population 2. Differences between the paired or matched data values are used to test for a difference between the two populations. The Wilcoxon signed-rank test is applicable when no assumption can be made about the form of the probability distributions for the populations.

A

Another nonparametric test for detecting differences between two populations is the Mann-Whitney-Wilcoxon test. This method is based on data from two independent random samples, one from population 1 and another from population 2. There is no matching or pairing as required for the Wilcoxon signed-rank test.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Nonparametric methods for correlation analysis are also available. The Spearman rank correlation coefficient is a measure of the relationship between two variables when data in the form of rank orders are available. For instance, the Spearman rank correlation coefficient could be used to determine the degree of agreement between men and women concerning their preference ranking of 10 different television shows.

A

A Spearman rank correlation coefficient of 1 would indicate complete agreement, a coefficient of −1 would indicate complete disagreement, and a coefficient of 0 would indicate that the rankings were unrelated.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Statistical quality control

A

refers to the use of statistical methods in the monitoring and maintaining of the quality of products and services. One method, referred to as acceptance sampling, can be used when a decision must be made to accept or reject a group of parts or items based on the quality found in a sample. A second method, referred to as statistical process control, uses graphical displays known as control charts to determine whether a process should be continued or should be adjusted to achieve the desired quality.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Acceptance sampling

A

Assume that a consumer receives a shipment of parts called a lot from a producer. A sample of parts will be taken and the number of defective items counted. If the number of defective items is low, the entire lot will be accepted. If the number of defective items is high, the entire lot will be rejected. Correct decisions correspond to accepting a good-quality lot and rejecting a poor-quality lot. Because sampling is being used, the probabilities of erroneous decisions need to be considered. The error of rejecting a good-quality lot creates a problem for the producer; the probability of this error is called the producer’s risk. On the other hand, the error of accepting a poor-quality lot creates a problem for the purchaser or consumer; the probability of this error is called the consumer’s risk.

The design of an acceptance sampling plan consists of determining a sample size n and an acceptance criterion c, where c is the maximum number of defective items that can be found in the sample and the lot still be accepted. The key to understanding both the producer’s risk and the consumer’s risk is to assume that a lot has some known percentage of defective items and compute the probability of accepting the lot for a given sampling plan. By varying the assumed percentage of defective items in a lot, several different sampling plans can be evaluated and a sampling plan selected such that both the producer’s and consumer’s risks are reasonably low.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

statistical process control

A

uses sampling and statistical methods to monitor the quality of an ongoing process such as a production operation. A graphical display referred to as a control chart provides a basis for deciding whether the variation in the output of a process is due to common causes (randomly occurring variations) or to out-of-the-ordinary assignable causes. Whenever assignable causes are identified, a decision can be made to adjust the process in order to bring the output back to acceptable quality levels.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Control charts can be classified by the type of data they contain. For instance, an x̄-chart is employed in situations where a sample mean is used to measure the quality of the output. Quantitative data such as length, weight, and temperature can be monitored with an x̄-chart. Process variability can be monitored using a range or R-chart.

A

In cases in which the quality of output is measured in terms of the number of defectives or the proportion of defectives in the sample, an np-chart or a p-chart can be used.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

All control charts are constructed in a similar fashion. For example, the centre line of an x̄-chart corresponds to the mean of the process when the process is in control and producing output of acceptable quality. The vertical axis of the control chart identifies the scale of measurement for the variable of interest. The upper horizontal line of the control chart, referred to as the upper control limit, and the lower horizontal line, referred to as the lower control limit, are chosen so that when the process is in control there will be a high probability that the value of a sample mean will fall between the two control limits. Standard practice is to set the control limits at three standard deviations above and below the process mean.

A

The process can be sampled periodically. As each sample is selected, the value of the sample mean is plotted on the control chart. If the value of a sample mean is within the control limits, the process can be continued under the assumption that the quality standards are being maintained. If the value of the sample mean is outside the control limits, an out-of-control conclusion points to the need for corrective action in order to return the process to acceptable quality levels.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Sample survey methods

A

statistical inference is the process of using data from a sample to make estimates or test hypotheses about a population. The field of sample survey methods is concerned with effective ways of obtaining sample data. The three most common types of sample surveys are mail surveys, telephone surveys, and personal interview surveys. All of these involve the use of a questionnaire, for which a large body of knowledge exists concerning the phrasing, sequencing, and grouping of questions. There are other types of sample surveys that do not involve a questionnaire. For example, the sampling of accounting records for audits and the use of a computer to sample a large database are sample surveys that use direct observation of the sampled units to collect the data.

34
Q

A goal in the design of sample surveys is to obtain a sample that is representative of the population so that precise inferences can be made. Sampling error is the difference between a population parameter and a sample statistic used to estimate it. For example, the difference between a population mean and a sample mean is sampling error. Sampling error occurs because a portion, and not the entire population, is surveyed. Probability sampling methods, where the probability of each unit appearing in the sample is known, enable statisticians to make probability statements about the size of the sampling error.

A

Nonprobability sampling methods, which are based on convenience or judgment rather than on probability, are frequently used for cost and time advantages. However, one should be extremely careful in making inferences from a nonprobability sample; whether or not the sample is representative is dependent on the judgment of the individuals designing and conducting the survey and not on sound statistical principles. In addition, there is no objective basis for establishing bounds on the sampling error when a nonprobability sample has been used.

35
Q

Most governmental and professional polling surveys employ probability sampling. It can generally be assumed that any survey that reports a plus or minus margin of error has been conducted using probability sampling. Statisticians prefer probability sampling methods and recommend that they be used whenever possible. A variety of probability sampling methods are available. A few of the more common ones are reviewed here.

A

Simple random sampling provides the basis for many probability sampling methods. With simple random sampling, every possible sample of size n has the same probability of being selected. This method was discussed above in the section Estimation.

36
Q

Stratified simple random sampling is a variation of simple random sampling in which the population is partitioned into relatively homogeneous groups called strata and a simple random sample is selected from each stratum.

A

The results from the strata are then aggregated to make inferences about the population. A side benefit of this method is that inferences about the subpopulation represented by each stratum can also be made.

37
Q

Cluster sampling involves partitioning the population into separate groups called clusters. Unlike in the case of stratified simple random sampling, it is desirable for the clusters to be composed of heterogeneous units. In single-stage cluster sampling, a simple random sample of clusters is selected, and data are collected from every unit in the sampled clusters

A

In two-stage cluster sampling, a simple random sample of clusters is selected and then a simple random sample is selected from the units in each sampled cluster. One of the primary applications of cluster sampling is called area sampling, where the clusters are counties, townships, city blocks, or other well-defined geographic sections of the population.

38
Q

Decision Analysis

A

also called statistical decision theory, involves procedures for choosing optimal decisions in the face of uncertainty. In the simplest situation, a decision maker must choose the best decision from a finite set of alternatives when there are two or more possible future events, called states of nature, that might occur. The list of possible states of nature includes everything that can happen, and the states of nature are defined so that only one of the states will occur. The outcome resulting from the combination of a decision alternative and a particular state of nature is referred to as the payoff.

39
Q

When probabilities for the states of nature are available, probabilistic criteria may be used to choose the best decision alternative. The most common approach is to use the probabilities to compute the expected value of each decision alternative.

A

The expected value of a decision alternative is the sum of weighted payoffs for the decision. The weight for a payoff is the probability of the associated state of nature and therefore the probability that the payoff occurs. For a maximization problem, the decision alternative with the largest expected value will be chosen; for a minimization problem, the decision alternative with the smallest expected value will be chosen.

40
Q

Decision analysis can be extremely helpful in sequential decision-making situations—that is, situations in which a decision is made, an event occurs, another decision is made, another event occurs, and so on. For instance, a company trying to decide whether or not to market a new product might first decide to test the acceptance of the product using a consumer panel.

A

Based on the results of the consumer panel, the company will then decide whether or not to proceed with further test marketing; after analyzing the results of the test marketing, company executives will decide whether or not to produce the new product.

41
Q

A decision tree is a graphical device that is helpful in structuring and analyzing such problems. With the aid of decision trees, an optimal decision strategy can be developed.

A

A decision strategy is a contingency plan that recommends the best decision alternative depending on what has happened earlier in the sequential process.

42
Q

What is Experimental Design?

A

A designed experiment in statistics is essential. In the field of statistics, experimental design means the process of designing a statistical experiment, which is an experiment that is objective, controlled, and quantitative. An experiment is a procedure to test a hypothesis (an assumption of what the conclusion will be before beginning an experiment).

43
Q

The components of an experimental design are: observation, question, hypothesis, methods, and results. Every experiment starts with an observation of an item in nature followed by a question about the item and a hypothesis about why the item is the way it is.

A

In order to test the hypothesis, methods are used to reach end results, either proving or disproving the hypothesis.

44
Q

An effective experimental design must determine or define the following features prior to initiation of the experiment:

A

Research questions—specific concerns that the researcher aims to answer through the experiment

Data sampling—a sample (a subset of a population) taken to estimate the characteristics of the population

Treatment group—(also called the experimental group) receives the treatment the researcher is evaluating

Control group—does not receive the treatment but receives either a standard treatment (with known effect) or a placebo (a fake or inactive treatment)

Experimental unit—the physical and primary unit of interest that receives a treatment; it is the subject of the experiment, which is commonly a person, animal, or thing

Independent variables—the factors (or causes) that the researcher controls and intentionally changes during an experiment in order to see how the dependent variables are impacted; they don’t depend on any other variable in the experiment

Dependent variables—the effects the researcher measures; they change in response to the independent variable(s) in the experiment

45
Q

Understanding the concepts of control groups, treatment groups, and variables is essential in experimental design. The following two statistical experiment examples further illustrate these concepts:

A

100 students are recruited for a study. Half attend a review session for an exam (treatment group) while the other half study as they normally would for the exam (control group). The second group is the control group as it does not receive the treatment of attending a review session.

Four beans are placed under four different types of light (independent variable). After six days, the height of the plant from each bean (dependent variable) is measured. The type of light is controlled by the researcher, so it is the independent variable. The height of the beans is the dependent variable because it depends on the independent variable which is the light.

46
Q

Step 1: Choose an Experimental Design
There are many types of experimental designs in statistics:

A

Case study—an in-depth study of a specific subject, such as a person, group, or event

Between-subjects design—a study where participants are assigned to just a single treatment; this means one group would receive a treatment while the other group would receive another treatment and the differences between the two groups would be compared

Cohort study—a study that follows a group of people over time

Quasi-experimental design—a study that resembles a true experimental design yet there is no random assignment of participants to groups

Repeated-measures design—a study where multiple or repeated measurements are made on each experimental unit

Matched-pairs design—a study where participants with the same characteristics are put in pairs and within the pair, one is assigned to the treatment group while the other goes to the control group

Survey design—a study that involves conducting research through the administration of surveys to participants

Observational study—a study that observes subjects and measures variables in order to investigate questions about a population or an association between two variables. Cross-sectional and longitudinal studies are two types of observational studies. Cross-sectional studies collect data from subjects at one specific point in time while longitudinal studies repeatedly collect data for the same subjects over a period of time,

Thus, one can choose the appropriate type of experimental design based on the defining features prior to initiation and what is needed in the experiment. This, choosing an experimental design, is the first step in experimental design.

47
Q

Step 2: Establish Research Questions, Hypotheses, Variables, and Controls
After choosing the appropriate type of experimental design, the researcher can proceed with establishing and defining the following:

A

Research Question—How does the frequency of watering and weeding over a 12-week period affect flower growth rate in height?

Hypothesis—The researcher assumes daily watering and weeding will yield twice the growth rate of not watering and weeding.

Independent Variables—Weeding and watering because they are controlled by the researcher, and therefore, do not depend on any other variable.

Dependent Variable—The flower height because it is affected by the independent variables and is what the researcher is measuring.

Treatment Groups—The first and second plots because they both receive the treatment of watering and weeding.
Control Group—The third plot because it does not receive treatment.

Experimental Units—The plots of flowers because they are the primary units receiving the treatment.

An important point to also consider is to ensure the experiment is controlled for bias and extraneous variables. If the treatment group or control group is not representative of the larger population, selection bias, an extraneous variable, can occur which undermines the validity of an experiment. Reducing the number of experimental units too far, is one way that will increase the potential of extraneous variables impacting the outcome of the experiment.

Some effective ways to control for extraneous variables are watching for bias, setting controls, and randomization.

48
Q

Step 3: Apply a Sampling Method and Partition the Groups
In the previous section, the concept of random sampling was introduced. Random sampling along with other common approaches to sampling are defined as:

A

Simple random sampling—randomly selects subjects from the population (without any consideration) to represent the entire population

Stratified random sampling—first divides the population into strata or smaller subgroups based on shared characteristics and then randomly selects from each subgroup (so members from each subgroup will be in the data analysis)

Convenience sampling—selects subjects out of the population that are easily accessible or convenient to the researcher

After a researcher obtains their sample via one of these sampling methods, the participants within the sample can then be partitioned. This means the participants can be separated into control and treatment groups by different assignment methods, such as:

Randomized assignment—randomly assigns participants to either the control or a treatment group

Block assignment—sorts experimental units into homogeneous groups called blocks which helps to control sources of variation or nuisance variables

Factorial assignment—assigns participants based on a factorial experiment which is when an experiment has two or more independent variables

These concepts lend to the principles of experimental design. In order to organize experiments in a manner that creates reliable, non-biased data, there are four principles of experimental design: controlling, randomization, replication, and blocking.

49
Q

The concept of applying a sample method can be further illustrated upon review of the statistical experiment example of measuring flower growth. As previously mentioned, random sampling controls the impact of extraneous variables. However, it is not as effective in experiments with just a few experimental units. Since there are only two experimental units (plots of flowers receiving treatment) in the example case, random sampling will not help.

A

However, if the experiment were expanded to a greater number of plots, like over fifty plots, then random sampling would be effective, as plots could be randomly selected from the population to represent the entire population.

50
Q

Step 4: collect and analyze the data

A

The next step is to collect and analyze the data. The data collected in an experiment can be analyzed through the use of statistical tests. Common statistical test types include correlation, regression analysis, analysis of variance (ANOVA), t-test, and chi-square.

51
Q

For smaller or simple experiments, data collected can also be analyzed through comparison, like comparing two measurements on a histogram or a chart. In the statistical experiment example case, data obtained can be simply analyzed through comparison. For instance, the data may show after twelve weeks that the first, second, and third plots yields flowers that are 8, 6, and 2 inches high, respectively. Since this experiment does not have a lot of data, simple comparison can be used to analyze the results.

A

After analysis of the data, conclusions can then be made, and the hypothesis can be evaluated to see if it is valid or not.

52
Q

Bioinformatics is defined as the application of tools of computation and analysis to the capture and interpretation of biological data. It is an interdisciplinary field, which harnesses computer science, mathematics, physics, and biology

A

Bioinformatics is essential for management of data in modern biology and medicine. This paper describes the main tools of the bioinformatician and discusses how they are being used to interpret biological data and to further understanding of disease. The potential clinical applications of these data in drug discovery and development are also discussed.

53
Q

The main tools of a bioinformatician are computer software programs and the internet. A fundamental activity is sequence analysis of DNA and proteins using various programs and databases available on the world wide web. Anyone, from clinicians to molecular biologists, with access to the internet and relevant websites can now freely discover the composition of biological molecules such as nucleic acids and proteins by using basic bioinformatic tools.

A

This does not imply that handling and analysis of raw genomic data can easily be carried out by all. Bioinformatics is an evolving discipline, and expert bioinformaticians now use complex software programs for retrieving, sorting out, analysing, predicting, and storing DNA and protein sequence data.

54
Q

Large commercial enterprises such as pharmaceutical companies employ bioinformaticians to perform and maintain the large scale and complicated bioinformatic needs of these industries. With an ever-increasing need for constant input from bioinformatic experts, most biomedical laboratories may soon have their own in-house bioinformatician.

A

The individual researcher, beyond a basic acquisition and analysis of simple data, would certainly need external bioinformatic advice for any complex analysis.

55
Q

The growth of bioinformatics has been a global venture, creating computer networks that have allowed easy access to biological data and enabled the development of software programs for effortless analysis.

A

Multiple international projects aimed at providing gene and protein databases are available freely to the whole scientific community via the internet.

56
Q

The escalating amount of data from the genome projects has necessitated computer databases that feature rapid assimilation, usable formats and algorithm software programs for efficient management of biological data.13 Because of the diverse nature of emerging data, no single comprehensive database exists for accessing all this information. However, a growing number of databases that contain helpful information for clinicians and researchers are available. Information provided by most of these databases is free of charge to academics, although some sites require subscription and industrial users pay a licence fee for particular sites.

A

Examples range from sites providing comprehensive descriptions of clinical disorders, listing disease susceptibility genetic mutations and polymorphisms, to those enabling a search for disease genes given a DNA sequence (box).

57
Q

These databases include both “public” repositories of gene data as well as those developed by private companies. The easiest way to identify databases is by searching for bioinformatic tools and databases in any one of the commonly used search engines. Another way to identify bioinformatic sources is through database links and searchable indexes provided by one of the major public databases. For example, the National Center for Biotechnology Information (www.ncbi.nlm.nih.gov) provides the Entrez browser, which is an integrated database retrieval system that allows integration of DNA and protein sequence databases.

A

The European Bioinformatic Institute archives gene and protein data from genome studies of all organisms, whereas Ensembl produces and maintains automatic annotation on eukaryotic genomes (fig ​(fig2).2). The quality and reliability of databases vary; certainly some of the better known and more established ones, such as those above, are superior to others.

58
Q

Since the completion of the first draft of the human genome,1,2 the emphasis has been changing from genes themselves to gene products.

A

Functional genomics assigns functional relevance to genomic information. It is the study of genes, their resulting proteins, and the role played by the proteins.

59
Q

Analysis and interpretation of biological data considers information not only at the level of the genome but at the level of the proteome and the transcriptome (fig ​(fig4).4). Proteomics is the analysis of the total amount of proteins (proteome) expressed by a cell, and transcriptomics refers to the analysis of the messenger RNA transcripts produced by a cell (transcriptome). DNA microarray technology determines the expression level of genes and includes genotyping and DNA sequencing.

A

Gene expression arrays allow simultaneous analysis of the messenger RNA expression levels of thousands of genes in benign and malignant tumours, such as keloid and melanoma. Expression profiles classify tumours and provide potential therapeutic targets

60
Q

Bioinformatic protein research draws on annotated protein and two dimensional electrophoresis databases. After separation, identification, and characterisation of a protein, the next challenge in bioinformatics is the prediction of its structure.

A

Structural biologists also use bioinformatics to handle the vast and complex data from x ray crystallography, nuclear magnetic resonance, and electron microscopy investigations to create three dimensional models of molecules.

61
Q

Apart from analysis of genome sequence data, bioinformatics is now being used for a vast array of other important tasks, including analysis of gene variation and expression, analysis and prediction of gene and protein structure and function, prediction and detection of gene regulation networks, simulation environments for whole cell modelling, complex modelling of gene regulatory dynamics and networks, and presentation and analysis of molecular pathways in order to understand gene-disease interactions.

A

Although on a smaller scale, simpler bioinformatic tasks valuable to the clinical researcher can vary from designing primers (short oligonucleotide sequences needed for DNA amplification in polymerase chain reaction experiments) to predicting the function of gene products.

62
Q

A t test is a statistical test that is used to compare the means of two groups.

A

It is often used in hypothesis testing to determine whether a process or treatment actually has an effect on the population of interest, or whether two groups are different from one another.

63
Q

t test example
You want to know whether the mean petal length of iris flowers differs according to their species. You find two different species of irises growing in a garden and measure 25 petals of each species. You can test the difference between these two groups using a t test and null and alterative hypotheses.

A

The null hypothesis (H0) is that the true difference between these group means is zero.

The alternate hypothesis (Ha) is that the true difference is different from zero.

64
Q

A t test can only be used when comparing the means of two groups (a.k.a. pairwise comparison). If you want to compare more than two groups, or if you want to do multiple pairwise comparisons, use an ANOVA test or a post-hoc test.

The t test is a parametric test of difference, meaning that it makes the same assumptions about your data as other parametric tests. The t test assumes your data:

A
  1. are independent
  2. are (approximately) normally distributed
  3. have a similar amount of variance within each group being compared (a.k.a. homogeneity of variance)

If your data do not fit these assumptions, you can try a nonparametric alternative to the t test, such as the Wilcoxon Signed-Rank test for data with unequal variances.

65
Q

What type of t test should I use?

A

When choosing a t test, you will need to consider two things: whether the groups being compared come from a single population or two different populations, and whether you want to test the difference in a specific direction.

66
Q

One-sample, two-sample, or paired t test?

A

If the groups come from a single population (e.g., measuring before and after an experimental treatment), perform a paired t test. This is a within-subjects design.

If the groups come from two different populations (e.g., two different species, or people from two separate cities), perform a two-sample t test (a.k.a. independent t test). This is a between-subjects design.

If there is one group being compared against a standard value (e.g., comparing the acidity of a liquid to a neutral pH of 7), perform a one-sample t test.

67
Q

One-tailed or two-tailed t test?

A

If you only care whether the two populations are different from one another, perform a two-tailed t test.

If you want to know whether one population mean is greater than or less than the other, perform a one-tailed t test.

68
Q

In your test of whether petal length differs by species:

A

Your observations come from two separate populations (separate species), so you perform a two-sample t test.

You don’t care about the direction of the difference, only whether there is a difference, so you choose to use a two-tailed t test.

69
Q

performing a t test

A

The t test estimates the true difference between two group means using the ratio of the difference in group means over the pooled standard error of both groups. You can calculate it manually using a formula, or use statistical analysis software.

70
Q

T test formula

A

The formula for the two-sample t test (a.k.a. the Student’s t-test) is shown below.
x1 - x2
t = ————————–
sqrt/ (S^2 (1/n1 + 1/n2))
In this formula, t is the t value, x1 and x2 are the means of the two groups being compared, s2 is the pooled standard error of the two groups, and n1 and n2 are the number of observations in each of the groups.

A larger t value shows that the difference between group means is greater than the pooled standard error, indicating a more significant difference between the groups.

You can compare your calculated t value against the values in a critical value chart (e.g., Student’s t table) to determine whether your t value is greater than what would be expected by chance. If so, you can reject the null hypothesis and conclude that the two groups are in fact different.

71
Q

T test function in statistical software

A

Most statistical software (R, SPSS, etc.) includes a t test function. This built-in function will take your raw data and calculate the t value. It will then compare it to the critical value, and calculate a p-value. This way you can quickly see whether your groups are statistically different.

In your comparison of flower petal lengths, you decide to perform your t test using R. The code looks like this:

t.test(Petal.Length ~ Species, data = flower.data)

72
Q

presenting the results of a t test

A

When reporting your t test results, the most important values to include are the t value, the p value, and the degrees of freedom for the test. These will communicate to your audience whether the difference between the two groups is statistically significant (a.k.a. that it is unlikely to have happened by chance).

You can also include the summary statistics for the groups being compared, namely the mean and standard deviation. In R, the code for calculating the mean and the standard deviation from the data looks like this:

flower.data %>%
group_by(Species) %>%
summarize(mean_length = mean(Petal.Length),
sd_length = sd(Petal.Length))

73
Q

Linear regression analysis is used to predict the value of a variable based on the value of another variable. The variable you want to predict is called the dependent variable. The variable you are using to predict the other variable’s value is called the independent variable.

A

This form of analysis estimates the coefficients of the linear equation, involving one or more independent variables that best predict the value of the dependent variable. Linear regression fits a straight line or surface that minimizes the discrepancies between predicted and actual output values. There are simple linear regression calculators that use a “least squares” method to discover the best-fit line for a set of paired data. You then estimate the value of X (dependent variable) from Y (independent variable).

74
Q

Linear-regression models are relatively simple and provide an easy-to-interpret mathematical formula that can generate predictions. Linear regression can be applied to various areas in business and academic study.

A

You’ll find that linear regression is used in everything from biological, behavioral, environmental and social sciences to business. Linear-regression models have become a proven way to scientifically and reliably predict the future. Because linear regression is a long-established statistical procedure, the properties of linear-regression models are well understood and can be trained very quickly.

75
Q

Key assumptions of effective linear regression
Assumptions to be considered for success with linear-regression analysis:

A

For each variable: Consider the number of valid cases, mean and standard deviation.
For each model: Consider regression coefficients, correlation matrix, part and partial correlations, multiple R, R2, adjusted R2, change in R2, standard error of the estimate, analysis-of-variance table, predicted values and residuals. Also, consider 95-percent-confidence intervals for each regression coefficient, variance-covariance matrix, variance inflation factor, tolerance, Durbin-Watson test, distance measures (Mahalanobis, Cook and leverage values), DfBeta, DfFit, prediction intervals and case-wise diagnostic information.
Plots: Consider scatterplots, partial plots, histograms and normal probability plots.
Data: Dependent and independent variables should be quantitative. Categorical variables, such as religion, major field of study or region of residence, need to be recoded to binary (dummy) variables or other types of contrast variables.
Other assumptions: For each value of the independent variable, the distribution of the dependent variable must be normal. The variance of the distribution of the dependent variable should be constant for all values of the independent variable. The relationship between the dependent variable and each independent variable should be linear and all observations should be independent.

76
Q

Make sure your data meets linear-regression assumptions
Before you attempt to perform linear regression, you need to make sure that your data can be analyzed using this procedure. Your data must pass through certain required assumptions.

A

Here’s how you can check for these assumptions:

The variables should be measured at a continuous level. Examples of continuous variables are time, sales, weight and test scores.
Use a scatterplot to find out quickly if there is a linear relationship between those two variables.
The observations should be independent of each other (that is, there should be no dependency).
Your data should have no significant outliers.
Check for homoscedasticity — a statistical concept in which the variances along the best-fit linear-regression line remain similar all through that line.
The residuals (errors) of the best-fit regression line follow normal distribution.

77
Q

Simple linear regression

A

1 dependent variable (interval or ratio), 1 independent variable (interval or ratio or dichotomous)

78
Q

Multiple linear regression

A

1 dependent variable (interval or ratio) , 2+ independent variables (interval or ratio or dichotomous)

79
Q

Logistic regression

A

1 dependent variable (dichotomous), 2+ independent variable(s) (interval or ratio or dichotomous)

80
Q

Ordinal regression

A

1 dependent variable (ordinal), 1+ independent variable(s) (nominal or dichotomous)

81
Q

Multinomial regression

A

1 dependent variable (nominal), 1+ independent variable(s) (interval or ratio or dichotomous)

82
Q

Discriminant analysis

A

1 dependent variable (nominal), 1+ independent variable(s) (interval or ratio)