Robert E. Fromm, Jr., M.D., M.P.H.

**Department of Medicine**

Baylor College of Medicine, Houston, Texas
USA

and

**Department of Anesthesiology & Critical
Care**

The University of Texas M.D. Anderson Cancer Center, Houston, Texas, USA

Although not commonly considered a clinical subject, statistics and epidemiology form the cornerstone of clinical practice. An understanding of statistical principles is necessary to comprehend the published literature and practice in a rational manner. The purpose of this manuscript is to review some of the basic statistical principles and formulas. More in-depth discussion can be obtained in texts of biostatistics.

** Prevalence** is the most frequently used measure of disease frequency and is defined as:

Number of existing cases of a disease

Prevalence = ----------------------------------------------

Total population at a given point in time

** Incidence** quantifies the number of

Number of new cases of a disease during a given time period

Cumulative incidence = ------------------------------------------------------------------------

** **
Total population at risk

** Cumulative incidence** (CI) reflects the probability that an individual develops a disease during a given time period.

** Incidence density** (ID) allows one to account for varying periods of follow-up and is calculated as:

New cases of the disease during a given period of time

ID = -----------------------------------------------------------------------------------

Total person-time of observation

Special types of incidence and prevalence measures are reported.

** Mortality rate** is an incidence measure:

Number of deaths

Mortality rate = ---------------------------------------------------------------------

Total population

** Case-fatality rate** is another incidence measure

Number of deaths from the disease

Case-fatality rate = -----------------------------------------------------------

Number of cases of the disease

** Attack rate** is also an incidence measure:

Number of cases of the disease

Attack rate = -------------------------------------------------------------------

Total population at risk for a given time period

The performance of a laboratory test is commonly reported in terms of sensitivity and specificity defined as:

True positives

Sensitivity = -----------------------------------------------------------

True positives + false negatives

True negatives

Specificity = -----------------------------------------------------------

True negatives + false positives

Thus, ** sensitivity** measures the number of people who truly have the disease who test positive.

These crude measurements of laboratory performance do no take into account
the level at which a test is determined to be positive.

** Receiver operator characteristics curves (ROC curves) ** examine the performance of a test throughout its range of values. An area under the ROC curve of 1.0 is a perfect test while a test that is no better than flipping a coin has an area under the ROC curve of 0.5.

As a clinician examining a positive test, we are most interested in determining
whether a patient actually has disease.

The ** positive predictive value** (PPV) provides this probability:

True positives

PPV = -----------------------------------------------------------------

True positives + false positives

Prevalence x sensitivity

PPV = ------------------------------------------------------------------------

Prevalence x sensitivity + (1 - prevalence) x (1 - specificity)

** Negative predictive value** (NPV) describes the probability of a patient testing negative for the disease truly who does not have the disease:

True negatives

NPV = ------------------------------------------------------------------------

True negatives + false negatives

(1 - prevalence) x specificity

NPV = ------------------------------------------------------------------------

(1 - prevalence) x specificity + prevalence x (1 - sensitivity)

A large collection of data cannot be really appreciated by simple scrutiny. Summary or descriptive statistics help to succinctly describe the data. Two measures are usually employed. A measure of central tendency and a measure of dispersion.

** Measures of central tendency** include mean, median and mode.

** Median** is the middle value. The value such that one half of the data points fall below and one half falls above.

** Measures of dispersion** include the range, interquartile range, variance and standard deviation.

Range = Greatest value - Least value

The ** interquartile range** (IQR) is the range of the middle 50% of the data.

IQR = U75 - L25

where U75 is the upper 75th percentile and L25 is the lower 25th percentile.

** Variance **is the average of the squared distances between each of the values and the mean and

** Hypothesis testing** involves conducting a test of statistical significance, quantifying the degree to which random variability may account for the observed results. In preforming hypothesis testing two types of error can be made:

** Type I errors** refer to a situation in which statistical significance is found when no difference actually exists. The probability of making a
type I error is equal to the

The following methods are the most frequently used test for biological data.

** Chi-square test** (x2) is used for discreet data such as counts.

The general form of a chi-square test is:

(Observed -
expected)^{2}

Chi-square = Sumation of -------------------------------------------------

Expected

Chi-square is commonly used in *contingency tables*:

Diseased | Not Diseased | Totals | |
---|---|---|---|

Exposed | a | b | a + b |

Not Exposed | c | d | c + d |

a + c | b + d | a + b + c + d |

** Yates correction:** When the expected value of any particular cell is less than 5, the Yates correction is used. This is calculated as:

(Observed - expected-
0.5)^{2}

Chi-square *Yates corrected* = Summation of
-------------------------------------------------

Expected

** Relative risk: **The data within a contingency table is commonly summarized in measures such as the relative risk. If we gather groups based on their exposure status, relative risk can be calculated as:

a

----------------

a+b

*Relative risk* =
-----------------------------------------------------------

c

----------------

c+d

This figure represents the risk of becoming diseased if you are exposed (a/a+b) divided by the risk if you are not exposed (c/c+d), which is why is called relative risk. If the relative risk is calculated at 4.0, then the risk of becoming diseased if you are exposed is four times that of people who are not exposed.

** Odds ratio: **If we gather groups based on disease status, the odds ratio is calculated as an approximation to the relative risk:

a

--------

b

*Odds ratio* = --------------------

c

--------

d

This measure is the ratio of the odds of getting disease if you are exposed and the odds of becoming diseased if you are not.

** t-test:** Usually used in comparing means.

** Normal approximation for comparing two proportions: **A method for comparing whether two proportions are significantly different.

** Analysis of variance**: This method is commonly used to compare means across more than 2 categories.

** Regression techniques**: Generally obtained via computer programs; can be used to predict a continuous variable from single or multiple regressors which are either categorical, continuous or both.

All pages copyright © Priory Lodge Education Ltd1994,1995,1996.