STA
210B/ENV 251
Statistics and Data Analysis for
the Biological Sciences
In this page, I will maintain a list of terms and concepts that we are learning in this portion of the course. This is a dynamic reference to reinforce the lecture material. It will be updated frequently. Terms will be added as they are introduced in the course. This page may also be used as an exam review sheet.
Can't find a term? If we are using a term or concept in lecture or lab that is not in this list, email me immediately. I will include it here and review it in the following lecture. Additionally, you may refer to either of your textbooks.
Hint: Try using Find or
Ctrl-F
to search for terms.
Primitive Terms
Inference is the act drawing a conclusion from or making a decision
upon an analysis of data.
Measurement Theory
Measurement is the process of assigning numbers or categorical labels to a perceptible property of an object.
Nominal measurement is a assignment of finite labels of exhaustive and mutually exclusive categories, especially where there is no essential ordering of the categories, e.g. Marital Status may be measured as Never Married, Married, Divorced or Separated, or Widowed. An nominal measurement is binary or dichotomous where there are only two values. When more are possible, we say it is multiple nominal or polytomous.
Ordinal measurement is an assignment to finite categories for which there is a meaningful ranking, e.g. Infant, Child, Adolescent, Adult have a natural order. Numerical values may be applied to ordinal measurements, but this is not essential, and many arithmetic properties may not be meaningful.
Numerical measurement is an assignment to numerical values for which differences or ratios of two measurements are precise and meaningful. Interval scale refers to measurements where differences are meaningful, e.g. Temperature in C or F. Ratio scale refers to meaningful ratios (zero is absolute), e.g. Temperature in Kelvin, beginning with Absolute Zero.
Measurement error is the unexplainable discrepancy between a measurement and the precise quality which the measurement instrument is intended to measure. Precision refers to both variance and bias, where variance is measure of the random variation of measurements on objects of identical quality, e.g. bathroom has greater variance than laboratory scale, and bias is a systematic error in measurement, e.g. a bathroom scale may be out of adjustment throwing off all subsequent measurements.
A measure of a property is valid if it is a relevant and appropriate representation of the intended property. Furthermore a measure is said to have predictive validity if it successfully predicts outcomes which are associated with the property to be measure.
A composite measure or index is a mathematical combination of several measures or indicators, especially for the purpose of increasing validity and precision, e.g. Toxic Exposure as a sum of toxins of different species present a specific location.
A measurement process is said to be unbiased if it does not systematically understate or overstate the true value of the quality being measured. (The meaning of the word bias will mean slightly different thing in different settings, but the essential idea is that of systematic error in valuation.)
A measurement process is said to be reliable if repeated measurements give the same, or nearly the same, results. For numerical variables, reliability is quantified by variance or some other measure of spread.
For categorical variables, bias and reliability are intertwined. Nondetection
or undercount is the failure to identify all members of a certainty
class, e.g. from a helicopter, one may fail to observe all moose within
a certain area. A medical test, or some other classification procedure,
may error yielding either a false-positive or a false-negative. A false-positive
is the classification of an object into a category to which it does not
properly belong, e.g. person tests positive for HIV but is not infected
with HIV. A false-negative is the classification of an object into
any category other than the one to which it belongs, e.g. a divorced person
is falsely reported as never married or married.
Prediction and Causal Analysis
Explanatory variable, often denoted X, is a variable or set of variables which are thought to predict the outcome or response variable, often denoted Y. Regression analysis is the modeling of how variability in Y may be explained by values of X. Analysis of variance, ANOVA, is a particular regression analysis where X is measured at the multiple nominal level, specifically to test for differences in group means.
Association and correlation refer to the coincidence of observed measurements or properties.
Fitting Y to X refers to a regression analysis which merely identifies a patter or association in the observed data: X and Y are observed simultaneously.
Predicting y from x is the extrapolation of a regression analysis to, or an inference to, an outcome not yet observed: x is observed before y.
X causes Y is the inference that if an external and arbitrary force changes the condition of X then the expected outcome of Y. This is the intervention notion of causation. Here y is not yet observed, and the value of x is not yet arbitrarily chosen. This is a much stronger inference, more difficult to arrive at, than mere predictive inference.
The counterfactual account of causation has a slightly different
flavor. Suppose x and y are observed (the factual). If x' had instead
been chosen, then y' would have occurred (the counterfactual). One
cannot observe both (x,y) and (x',y'); one must be the counterfactual,
existing perhaps in some other metaphysical universe or grammatically in
the subjunctive mood. The causal effect of choosing x' over x is
the difference between y and the expected value of y'.
Experimental Design
Explanatory variables may be further differentiated as either a manipulated variable or an extraneous variable. The manipulated variable is the specific treatment or choice of treatment or control. The control is the treatment group to which all other treatments will be compared, often the traditional practice.
Extraneous variables, often denoted Z, are all factors which may influence the response or the choice of treatment. A factor which may influence both choice of treatment and response is called a confounding variable. The possibility of a confounding variable undermines any claim of causality since it cannot be ruled out as the cause of both the response and the explanatory variable. Some possible confounders may be measured, and some statistical techniques can make appropriate adjustments in this case. However, some possible confounders are not observable or were not observed. In this situation, the causal inference cannot be sustained on statistical grounds alone.
An experiment where all human subjects or experimental units are given the same treatment is called simple or uncontrolled. When the units are allocated into at least two treatment groups, each group receiving a different treatment, the experiment is said to be controlled. If the allocation is by random assignment, then the experiment is called randomized controlled. In any controlled experiment, one may compare the average outcomes of any two treatments (or treatment and control). The difference of these average outcomes is called an effect. In the case of a randomized controlled experience where the causal claim is merited, effects may also be called causal effects.
When experimenting, knowledge of assigned treatment may influence the response. To prevent a subject from knowing his or her treatment group is one type of blinding. Often a placebo, fake treatment, is needed to maintain this blinding. Researchers who come in direct contact with the subjects or unit may also need to be blinded about allocation. If both types of blinding are in practice, the experiment is said to be double-blinded.
Blocking is the grouping of experimental units according to similarity or origin, e.g. littermates or classmates. Block randomization is separate random allocation of units within each block to treatment groups.
Cross-over design is a type of experiment where experimental
units are given several treatments in succession. The order of treatments
ought to be set by some random procedure.
Describing Distributions
A measure of central tendency is a statistic which is in some sense close to most of the data. The mean is the sum of observations divided by the number of observations. The mean is sensitive to all observations, especially extreme observations. The median is a value which is greater than at most half of the observations and less than at most half of the observations. The median is not sensitive to extreme values. Modes are values which are locally most frequent, peaks of the distribution.
If there is only one mode, the distribution is said to be unimodal. If two or more, bimodal or multimodal. If a distribution is can be reflected about the center without disturbing the shape, then it is symmetric. In a symmetric distribution, the mean and median are the same. A distribution is skew if it is not symmetric; one tail will appear longer than the other. If the right tail is longer then it is skew to the right: vice versa, skew to the left.
Quantiles, also percentiles, are the smallest values greater than some specified fraction of the distribution. The median is the 0.50 quantile. The first quartile (Q1) is the 25th percentile, while the third quartile (Q3) is the 0.75 quantile. The second quartile is just the median! Quintiles partition the distribution into fifths, and deciles into tenths. The maximum and minimum are the 1 and 0 quantile. (In some hypothetical distributions, such as the normal distribution, the maximum and minimum are positive and negative infinity.) Often the presentation of the minimum, Q1, median, Q3, and maximum is called a five-number summary.
A measure of spread or variability is a statistic describing how far the distribution is from its center. The range is the difference between the maximum and minimum. The inter-quartile range (IQR) is the difference between the third and first quartiles. The range is highly sensitive to extreme values while the IQR is more resistant or stable. A deviation from the mean, also a residual, is the difference between an observation and the mean. Variance is the mean of the squared deviations from the mean. If the mean of the sample is used, then for the purpose of estimating the variance, one observation is effectively lost. So the sample variance is the sum of the squared deviations from the sample mean divided by one less than the number of observations. (See text book for formula.) The standard deviation (SD) is the square root of the variance. The standard deviation has the same units of measurement as the mean, but the variance has squared units. Variance and SD are both sensitive to extreme values.
Some extreme values are so far removed from the bulk of the observations that they do not appear to be a genuine observation. Such an observation is thought to be an outlier. Outlying cases ought to be examined carefully. The explanation may be as simple as a misplaced decimal point! Similarly, one ought to examine the clusters of observations about the modes in a multimodal distribution. There may be distinct subpopulations within the data, the discovery of which may have substantial impact on analysis and interpretation.
Describing Bivariate Association
Two variables are associated, correlated or dependent when knowledge of one improves the prediction of the other. Conversely, two variables are independent, uncorrelated, or not associated when knowledge of one does not improve the prediction of the other over what one might predict without knowledge of the first.
When the two variables are binary, an appropriate measure of association is the odds ratio (OR) or log odds ratio. Using a 2X2 table, the OR may be computed as the product of the diagonal cells [ (1,1) and (2,2) ] divided by the product of the off-diagonal cells [ (1,2) and (2,1) ]. This measures how much the odds of having trait 2 given trait 1 are multiplied over the odds of trait 2 given the lack of trait 1. When OR is 1, then the two traits are not associated. When OR is less than 1, there is an negative association, i.e., trait 2 is less likely when trait 1 is present.
Row frequencies are the ratios of cell counts in a contingency table to the sum of the corresponding row, row margins. Column frequencies are the ratios of cell counts to column sums or margin. Suppose the rows correspond to 'has trait 1' and 'does not have trait 1' and columns to trait 2. Then the row frequencies estimate the probability of trait 2 given knowledge of trait 1. The column frequencies estimate the probability of trait 1 given knowledge of trait 2.
When the two variables are continuous, an appropriate measure of association is Pearson's correlation coefficient. This is SPxy / (SSx * SSy)^0.5, where SPxy is the sum of products of X residuals and Y residuals, SSx is the sum of square residuals for X, and SSy is the sum of square residuals for Y. This correlation (r) is scaled so that it is always between -1 and 1. Correlation near 1 means that the variables are positively associated and with high agreement. Correlation near -1 means that the variables are negatively associated and in high agreement (up to a change of signs). Correlation near 0 indicates that there is little or no association between the variables. See Section 13.6 in Samuels for details and formulas.
(Splitting the data at the medians and examining the resulting 2X2 table is a test of association due to Blomqvist. Just as the IQR is a more resistant measure of spread than the standard deviation, the Blomqvist median test is more resistant measure of association than Pearson correlation coefficient. We may discuss other non-parametric, resistant methods later in the course. For now it is enough to know that resistant, or robust, methods are not severely affected by the presence of outliers in the data.)
Simple linear regression obtains a line which may be used to predict from one variable, X, to the other, Y. The equation for a line, y = a + bx, contains two regression coefficients, the intercept a and slope b. The slope b is estimated by the ratio of the sum of products for residuals of X and Y divided by the sum of squares of residuals for X. Intercept a is estimated by the mean of Y minus slope b times the mean of X. When predicting a new value y from x, we expect y has a mean of a + bx. We also expect that y will have a standard deviation greater than the root mean squared error (RMSE). A residual for the fitted line where (x, y) is an observation is e = y - a - bx. The sum of squared residuals is denoted SS(resid) or SSe, this may be computed as SSy - b*SPxy. RMSE is the square root of SSe/(n-2). (n is the number of observations, and the -2 is due to having estimated two parameters, a and b.) RMSE is an estimate of standard deviation of observations about the regression line; hence, Samuels calls it the residual standard deviation, but RMSE is the more common name. For details and formulas, see Sections 13:1-3 in Samuels.