Sect 3.1- Journals, design, analysis, bioinformatics, T test, & linear regression Flashcards

Question

Nonparametric methods

Answer 1

The statistical methods discussed above generally focus on the parameters of populations or probability distributions and are referred to as parametric methods. Nonparametric methods are statistical methods that require fewer assumptions about a population or probability distribution and are applicable in a wider range of situations. For a statistical method to be classified as a nonparametric method, it must satisfy one of the following conditions: (1) the method is used with qualitative data, or (2) the method is used with quantitative data when no assumption can be made about the population probability distribution. In cases where both parametric and nonparametric methods are applicable, statisticians usually recommend using parametric methods because they tend to provide better precision. Nonparametric methods are useful, however, in situations where the assumptions required by parametric methods appear questionable. A few of the more commonly used nonparametric methods are described below. Assume that individuals in a sample are asked to state a preference for one of two similar and competing products. A plus (+) sign can be recorded if an individual prefers one product and a minus (−) sign if the individual prefers the other product. With qualitative data in this form, the nonparametric sign test can be used to statistically determine whether a difference in preference for the two products exists for the population. The sign test also can be used to test hypotheses about the value of a population median.

Answer 2

Another nonparametric test for detecting differences between two populations is the Mann-Whitney-Wilcoxon test. This method is based on data from two independent random samples, one from population 1 and another from population 2. There is no matching or pairing as required for the Wilcoxon signed-rank test.

Answer 3

A Spearman rank correlation coefficient of 1 would indicate complete agreement, a coefficient of −1 would indicate complete disagreement, and a coefficient of 0 would indicate that the rankings were unrelated.

Answer 4

refers to the use of statistical methods in the monitoring and maintaining of the quality of products and services. One method, referred to as acceptance sampling, can be used when a decision must be made to accept or reject a group of parts or items based on the quality found in a sample. A second method, referred to as statistical process control, uses graphical displays known as control charts to determine whether a process should be continued or should be adjusted to achieve the desired quality.

Answer 5

Assume that a consumer receives a shipment of parts called a lot from a producer. A sample of parts will be taken and the number of defective items counted. If the number of defective items is low, the entire lot will be accepted. If the number of defective items is high, the entire lot will be rejected. Correct decisions correspond to accepting a good-quality lot and rejecting a poor-quality lot. Because sampling is being used, the probabilities of erroneous decisions need to be considered. The error of rejecting a good-quality lot creates a problem for the producer; the probability of this error is called the producer’s risk. On the other hand, the error of accepting a poor-quality lot creates a problem for the purchaser or consumer; the probability of this error is called the consumer’s risk. The design of an acceptance sampling plan consists of determining a sample size n and an acceptance criterion c, where c is the maximum number of defective items that can be found in the sample and the lot still be accepted. The key to understanding both the producer’s risk and the consumer’s risk is to assume that a lot has some known percentage of defective items and compute the probability of accepting the lot for a given sampling plan. By varying the assumed percentage of defective items in a lot, several different sampling plans can be evaluated and a sampling plan selected such that both the producer’s and consumer’s risks are reasonably low.

Answer 6

uses sampling and statistical methods to monitor the quality of an ongoing process such as a production operation. A graphical display referred to as a control chart provides a basis for deciding whether the variation in the output of a process is due to common causes (randomly occurring variations) or to out-of-the-ordinary assignable causes. Whenever assignable causes are identified, a decision can be made to adjust the process in order to bring the output back to acceptable quality levels.

Answer 7

In cases in which the quality of output is measured in terms of the number of defectives or the proportion of defectives in the sample, an np-chart or a p-chart can be used.

Answer 8

The process can be sampled periodically. As each sample is selected, the value of the sample mean is plotted on the control chart. If the value of a sample mean is within the control limits, the process can be continued under the assumption that the quality standards are being maintained. If the value of the sample mean is outside the control limits, an out-of-control conclusion points to the need for corrective action in order to return the process to acceptable quality levels.

Answer 9

statistical inference is the process of using data from a sample to make estimates or test hypotheses about a population. The field of sample survey methods is concerned with effective ways of obtaining sample data. The three most common types of sample surveys are mail surveys, telephone surveys, and personal interview surveys. All of these involve the use of a questionnaire, for which a large body of knowledge exists concerning the phrasing, sequencing, and grouping of questions. There are other types of sample surveys that do not involve a questionnaire. For example, the sampling of accounting records for audits and the use of a computer to sample a large database are sample surveys that use direct observation of the sampled units to collect the data.

Answer 10

Nonprobability sampling methods, which are based on convenience or judgment rather than on probability, are frequently used for cost and time advantages. However, one should be extremely careful in making inferences from a nonprobability sample; whether or not the sample is representative is dependent on the judgment of the individuals designing and conducting the survey and not on sound statistical principles. In addition, there is no objective basis for establishing bounds on the sampling error when a nonprobability sample has been used.

Answer 11

Simple random sampling provides the basis for many probability sampling methods. With simple random sampling, every possible sample of size n has the same probability of being selected. This method was discussed above in the section Estimation.

Answer 12

The results from the strata are then aggregated to make inferences about the population. A side benefit of this method is that inferences about the subpopulation represented by each stratum can also be made.

Answer 13

In two-stage cluster sampling, a simple random sample of clusters is selected and then a simple random sample is selected from the units in each sampled cluster. One of the primary applications of cluster sampling is called area sampling, where the clusters are counties, townships, city blocks, or other well-defined geographic sections of the population.

Answer 14

also called statistical decision theory, involves procedures for choosing optimal decisions in the face of uncertainty. In the simplest situation, a decision maker must choose the best decision from a finite set of alternatives when there are two or more possible future events, called states of nature, that might occur. The list of possible states of nature includes everything that can happen, and the states of nature are defined so that only one of the states will occur. The outcome resulting from the combination of a decision alternative and a particular state of nature is referred to as the payoff.

Answer 15

The expected value of a decision alternative is the sum of weighted payoffs for the decision. The weight for a payoff is the probability of the associated state of nature and therefore the probability that the payoff occurs. For a maximization problem, the decision alternative with the largest expected value will be chosen; for a minimization problem, the decision alternative with the smallest expected value will be chosen.

Answer 16

Based on the results of the consumer panel, the company will then decide whether or not to proceed with further test marketing; after analyzing the results of the test marketing, company executives will decide whether or not to produce the new product.

Answer 17

A decision strategy is a contingency plan that recommends the best decision alternative depending on what has happened earlier in the sequential process.

Answer 18

A designed experiment in statistics is essential. In the field of statistics, experimental design means the process of designing a statistical experiment, which is an experiment that is objective, controlled, and quantitative. An experiment is a procedure to test a hypothesis (an assumption of what the conclusion will be before beginning an experiment).

Answer 19

In order to test the hypothesis, methods are used to reach end results, either proving or disproving the hypothesis.

Answer 20

Research questions—specific concerns that the researcher aims to answer through the experiment Data sampling—a sample (a subset of a population) taken to estimate the characteristics of the population Treatment group—(also called the experimental group) receives the treatment the researcher is evaluating Control group—does not receive the treatment but receives either a standard treatment (with known effect) or a placebo (a fake or inactive treatment) Experimental unit—the physical and primary unit of interest that receives a treatment; it is the subject of the experiment, which is commonly a person, animal, or thing Independent variables—the factors (or causes) that the researcher controls and intentionally changes during an experiment in order to see how the dependent variables are impacted; they don't depend on any other variable in the experiment Dependent variables—the effects the researcher measures; they change in response to the independent variable(s) in the experiment

Answer 21

100 students are recruited for a study. Half attend a review session for an exam (treatment group) while the other half study as they normally would for the exam (control group). The second group is the control group as it does not receive the treatment of attending a review session. Four beans are placed under four different types of light (independent variable). After six days, the height of the plant from each bean (dependent variable) is measured. The type of light is controlled by the researcher, so it is the independent variable. The height of the beans is the dependent variable because it depends on the independent variable which is the light.

Answer 22

Case study—an in-depth study of a specific subject, such as a person, group, or event Between-subjects design—a study where participants are assigned to just a single treatment; this means one group would receive a treatment while the other group would receive another treatment and the differences between the two groups would be compared Cohort study—a study that follows a group of people over time Quasi-experimental design—a study that resembles a true experimental design yet there is no random assignment of participants to groups Repeated-measures design—a study where multiple or repeated measurements are made on each experimental unit Matched-pairs design—a study where participants with the same characteristics are put in pairs and within the pair, one is assigned to the treatment group while the other goes to the control group Survey design—a study that involves conducting research through the administration of surveys to participants Observational study—a study that observes subjects and measures variables in order to investigate questions about a population or an association between two variables. Cross-sectional and longitudinal studies are two types of observational studies. Cross-sectional studies collect data from subjects at one specific point in time while longitudinal studies repeatedly collect data for the same subjects over a period of time, Thus, one can choose the appropriate type of experimental design based on the defining features prior to initiation and what is needed in the experiment. This, choosing an experimental design, is the first step in experimental design.

Answer 23

Research Question—How does the frequency of watering and weeding over a 12-week period affect flower growth rate in height? Hypothesis—The researcher assumes daily watering and weeding will yield twice the growth rate of not watering and weeding. Independent Variables—Weeding and watering because they are controlled by the researcher, and therefore, do not depend on any other variable. Dependent Variable—The flower height because it is affected by the independent variables and is what the researcher is measuring. Treatment Groups—The first and second plots because they both receive the treatment of watering and weeding. Control Group—The third plot because it does not receive treatment. Experimental Units—The plots of flowers because they are the primary units receiving the treatment. An important point to also consider is to ensure the experiment is controlled for bias and extraneous variables. If the treatment group or control group is not representative of the larger population, selection bias, an extraneous variable, can occur which undermines the validity of an experiment. Reducing the number of experimental units too far, is one way that will increase the potential of extraneous variables impacting the outcome of the experiment. Some effective ways to control for extraneous variables are watching for bias, setting controls, and randomization.

Answer 24

Simple random sampling—randomly selects subjects from the population (without any consideration) to represent the entire population Stratified random sampling—first divides the population into strata or smaller subgroups based on shared characteristics and then randomly selects from each subgroup (so members from each subgroup will be in the data analysis) Convenience sampling—selects subjects out of the population that are easily accessible or convenient to the researcher After a researcher obtains their sample via one of these sampling methods, the participants within the sample can then be partitioned. This means the participants can be separated into control and treatment groups by different assignment methods, such as: Randomized assignment—randomly assigns participants to either the control or a treatment group Block assignment—sorts experimental units into homogeneous groups called blocks which helps to control sources of variation or nuisance variables Factorial assignment—assigns participants based on a factorial experiment which is when an experiment has two or more independent variables These concepts lend to the principles of experimental design. In order to organize experiments in a manner that creates reliable, non-biased data, there are four principles of experimental design: controlling, randomization, replication, and blocking.

Answer 25

However, if the experiment were expanded to a greater number of plots, like over fifty plots, then random sampling would be effective, as plots could be randomly selected from the population to represent the entire population.

Answer 26

The next step is to collect and analyze the data. The data collected in an experiment can be analyzed through the use of statistical tests. Common statistical test types include correlation, regression analysis, analysis of variance (ANOVA), t-test, and chi-square.

Answer 27

After analysis of the data, conclusions can then be made, and the hypothesis can be evaluated to see if it is valid or not.

Answer 28

Bioinformatics is essential for management of data in modern biology and medicine. This paper describes the main tools of the bioinformatician and discusses how they are being used to interpret biological data and to further understanding of disease. The potential clinical applications of these data in drug discovery and development are also discussed.

Answer 29

This does not imply that handling and analysis of raw genomic data can easily be carried out by all. Bioinformatics is an evolving discipline, and expert bioinformaticians now use complex software programs for retrieving, sorting out, analysing, predicting, and storing DNA and protein sequence data.

Answer 30

The individual researcher, beyond a basic acquisition and analysis of simple data, would certainly need external bioinformatic advice for any complex analysis.

Answer 31

Multiple international projects aimed at providing gene and protein databases are available freely to the whole scientific community via the internet.

Answer 32

Examples range from sites providing comprehensive descriptions of clinical disorders, listing disease susceptibility genetic mutations and polymorphisms, to those enabling a search for disease genes given a DNA sequence (box).

Answer 33

The European Bioinformatic Institute archives gene and protein data from genome studies of all organisms, whereas Ensembl produces and maintains automatic annotation on eukaryotic genomes (fig (fig2).2). The quality and reliability of databases vary; certainly some of the better known and more established ones, such as those above, are superior to others.

Answer 34

Functional genomics assigns functional relevance to genomic information. It is the study of genes, their resulting proteins, and the role played by the proteins.

Answer 35

Gene expression arrays allow simultaneous analysis of the messenger RNA expression levels of thousands of genes in benign and malignant tumours, such as keloid and melanoma. Expression profiles classify tumours and provide potential therapeutic targets

Answer 36

Structural biologists also use bioinformatics to handle the vast and complex data from x ray crystallography, nuclear magnetic resonance, and electron microscopy investigations to create three dimensional models of molecules.

Answer 37

Although on a smaller scale, simpler bioinformatic tasks valuable to the clinical researcher can vary from designing primers (short oligonucleotide sequences needed for DNA amplification in polymerase chain reaction experiments) to predicting the function of gene products.

Answer 38

It is often used in hypothesis testing to determine whether a process or treatment actually has an effect on the population of interest, or whether two groups are different from one another.

Answer 39

The null hypothesis (H0) is that the true difference between these group means is zero. The alternate hypothesis (Ha) is that the true difference is different from zero.

Answer 40

1. are independent 2. are (approximately) normally distributed 3. have a similar amount of variance within each group being compared (a.k.a. homogeneity of variance) If your data do not fit these assumptions, you can try a nonparametric alternative to the t test, such as the Wilcoxon Signed-Rank test for data with unequal variances.

Answer 41

When choosing a t test, you will need to consider two things: whether the groups being compared come from a single population or two different populations, and whether you want to test the difference in a specific direction.

Answer 42

If the groups come from a single population (e.g., measuring before and after an experimental treatment), perform a paired t test. This is a within-subjects design. If the groups come from two different populations (e.g., two different species, or people from two separate cities), perform a two-sample t test (a.k.a. independent t test). This is a between-subjects design. If there is one group being compared against a standard value (e.g., comparing the acidity of a liquid to a neutral pH of 7), perform a one-sample t test.

Answer 43

If you only care whether the two populations are different from one another, perform a two-tailed t test. If you want to know whether one population mean is greater than or less than the other, perform a one-tailed t test.

Answer 44

Your observations come from two separate populations (separate species), so you perform a two-sample t test. You don’t care about the direction of the difference, only whether there is a difference, so you choose to use a two-tailed t test.

Answer 45

The t test estimates the true difference between two group means using the ratio of the difference in group means over the pooled standard error of both groups. You can calculate it manually using a formula, or use statistical analysis software.

Answer 46

The formula for the two-sample t test (a.k.a. the Student’s t-test) is shown below. x1 - x2 t = -------------------------- sqrt/ (S^2 (1/n1 + 1/n2)) In this formula, t is the t value, x1 and x2 are the means of the two groups being compared, s2 is the pooled standard error of the two groups, and n1 and n2 are the number of observations in each of the groups. A larger t value shows that the difference between group means is greater than the pooled standard error, indicating a more significant difference between the groups. You can compare your calculated t value against the values in a critical value chart (e.g., Student’s t table) to determine whether your t value is greater than what would be expected by chance. If so, you can reject the null hypothesis and conclude that the two groups are in fact different.

Answer 47

Most statistical software (R, SPSS, etc.) includes a t test function. This built-in function will take your raw data and calculate the t value. It will then compare it to the critical value, and calculate a p-value. This way you can quickly see whether your groups are statistically different. In your comparison of flower petal lengths, you decide to perform your t test using R. The code looks like this: t.test(Petal.Length ~ Species, data = flower.data)

Answer 48

When reporting your t test results, the most important values to include are the t value, the p value, and the degrees of freedom for the test. These will communicate to your audience whether the difference between the two groups is statistically significant (a.k.a. that it is unlikely to have happened by chance). You can also include the summary statistics for the groups being compared, namely the mean and standard deviation. In R, the code for calculating the mean and the standard deviation from the data looks like this: flower.data %>% group_by(Species) %>% summarize(mean_length = mean(Petal.Length), sd_length = sd(Petal.Length))

Answer 49

This form of analysis estimates the coefficients of the linear equation, involving one or more independent variables that best predict the value of the dependent variable. Linear regression fits a straight line or surface that minimizes the discrepancies between predicted and actual output values. There are simple linear regression calculators that use a “least squares” method to discover the best-fit line for a set of paired data. You then estimate the value of X (dependent variable) from Y (independent variable).

Answer 50

You’ll find that linear regression is used in everything from biological, behavioral, environmental and social sciences to business. Linear-regression models have become a proven way to scientifically and reliably predict the future. Because linear regression is a long-established statistical procedure, the properties of linear-regression models are well understood and can be trained very quickly.

Answer 51

For each variable: Consider the number of valid cases, mean and standard deviation. For each model: Consider regression coefficients, correlation matrix, part and partial correlations, multiple R, R2, adjusted R2, change in R2, standard error of the estimate, analysis-of-variance table, predicted values and residuals. Also, consider 95-percent-confidence intervals for each regression coefficient, variance-covariance matrix, variance inflation factor, tolerance, Durbin-Watson test, distance measures (Mahalanobis, Cook and leverage values), DfBeta, DfFit, prediction intervals and case-wise diagnostic information. Plots: Consider scatterplots, partial plots, histograms and normal probability plots. Data: Dependent and independent variables should be quantitative. Categorical variables, such as religion, major field of study or region of residence, need to be recoded to binary (dummy) variables or other types of contrast variables. Other assumptions: For each value of the independent variable, the distribution of the dependent variable must be normal. The variance of the distribution of the dependent variable should be constant for all values of the independent variable. The relationship between the dependent variable and each independent variable should be linear and all observations should be independent.

Answer 52

Here’s how you can check for these assumptions: The variables should be measured at a continuous level. Examples of continuous variables are time, sales, weight and test scores. Use a scatterplot to find out quickly if there is a linear relationship between those two variables. The observations should be independent of each other (that is, there should be no dependency). Your data should have no significant outliers. Check for homoscedasticity — a statistical concept in which the variances along the best-fit linear-regression line remain similar all through that line. The residuals (errors) of the best-fit regression line follow normal distribution.

Answer 53

1 dependent variable (interval or ratio), 1 independent variable (interval or ratio or dichotomous)

Answer 54

1 dependent variable (interval or ratio) , 2+ independent variables (interval or ratio or dichotomous)

Answer 55

1 dependent variable (dichotomous), 2+ independent variable(s) (interval or ratio or dichotomous)

Answer 56

1 dependent variable (ordinal), 1+ independent variable(s) (nominal or dichotomous)

Answer 57

1 dependent variable (nominal), 1+ independent variable(s) (interval or ratio or dichotomous)

Answer 58

1 dependent variable (nominal), 1+ independent variable(s) (interval or ratio)