**AP Statistics Glossary**

**Alternative hypothesis**—the theory that the researcher hopes to confirm by rejecting the null hypothesis

**Association**—when some of the variability in one variable can be accounted for by the other

**Bar chart**—graph in which the frequencies or relative frequencies of categories are displayed with bars

**Bimodal**—distribution with two most common values; see **mode**

**Binomial distribution**—probability distribution for a random variable *X* in a binomial setting;

where *n* is the number of independent trials, *p* is the probability of success on each trial, and *x* is the count of successes out of the *n* trials

**Binomial setting (experiment)**—when each of a fixed number, *n*, of observations either succeeds or fails, independently, with probability *p*

**Bivariate data**—having to do with two variables

**Block**—a group of experimental units thought to be homogenous with respect to the response variable

**Block design**—procedure by which experimental units are put into homogeneous groups in an attempt to reduce variability due to the group on the response variable

**Blocking**—see **block design**

**Boxplot (box-and-whisker plot)**—graphical representation of the five-number summary of a dataset. Each value in the five-number summary is located over its corresponding value on a number line. A box is drawn that ranges from Q1 to Q3 and “whiskers” extend to the maximum and minimum values from Q1 and Q3.

**Categorical data**—data whose values range over categories rather than values. See also **qualitative data**

**Census**—attempt to contact every member of a population

**Center**—the “middle” of a distribution; either the mean or the median

**Central limit theorem**—theorem that states that the sampling distribution of a sample mean becomes approximately normal when the sample size is large

**Chi-square (**χ^{2}**) goodness-of-fit test**—compares a set of observed categorical values to a set of expected values under a set of hypothesized proportions for the categories;

**Cluster sample—**The population is first divided into sections or “clusters.” Then we randomly select an entire cluster, or clusters, and include all of the members of the cluster(s) in the sample.

**Coefficient of determination ( r**

^{2}

**)**—measures the proportion of variation in the response variable explained by regression on the explanatory variable

**Complement of an event**—set of all outcomes in the sample space that are not in the event

**Completely randomized design**—when all subjects (or experimental units) are randomly assigned to treatments in an experiment

**Conditional probability**—the probability of one event succeeding given that some other event has already occurred

**Confidence interval**—an interval that, with a given level of confidence, is likely to contain a population value; (estimate) ± (margin of error)

**Confidence level**—the probability that the procedure used to construct an interval will generate an interval that does contain the population value

**Confounding variable**—has an effect on the outcomes of the study but whose effects cannot be separated from those of the treatment variable

**Contingency table**—see **two-way table**

**Continuous data**—data that can be measured, or take on values in an interval; the set of possible values cannot be counted

**Continuous random variable**—a random variable whose values are continuous data; takes all values in an interval

**Control**—see **statistical control**

**Convenience sample**—sample chosen without any random mechanism; chooses individuals based on ease of selection

**Correlation coefficient ( r)**—measures the strength of the linear relationship between two quantitative variables;

**Correlation is not causation**—just because two variables correlate strongly does not mean that one caused the other

**Critical value**—values in a distribution that identify certain specified areas of the distribution

**Degrees of freedom**—number of independent datapoints in a distribution

**Density function**—a function that is everywhere nonnegative and has a total area equal to 1 underneath it and above the horizontal axis

**Descriptive statistics**—process of examining data analytically and graphically

**Dimension**—size of a two-way table; *r* × *c*

**Discrete data**—data that can be counted (possibly infinite) or placed in order

**Discrete random variable**—random variable whose values are discrete data

**Dotplot**—graph in which data values are identified as dots placed above their corresponding values on a number line

**Double blind**—experimental design in which neither the subjects nor the study administrators know what treatment a subject has received

**Empirical Rule (68-95-99.7 Rule)**—states that, in a normal distribution, about 68% of the terms are within one standard deviation of the mean, about 95% are within two standard deviations, and about 99.7% are within three standard deviations

**Estimate**—sample value used to approximate a value of a parameter

**Event**—in probability, a subset of a sample space; a set of one or more simple outcomes

**Expected value**—mean value of a discrete random variable

**Experiment**—study in which a researcher measures the responses to a treatment variable, or variables, imposed and controlled by the researcher

**Experimental units**—individuals on which experiments are conducted

**Explanatory variable**—explains changes in response variable; treatment variable; independent variable

**Extrapolation**—predictions about the value of a variable based on the value of another variable outside the range of measured values

**First quartile**—the value which identifies the 25th percentile; the value that has at least 25% of the data at or below it and at least 75% of the data at or above it.

**Five-number summary**—for a dataset, [minimum value, Q1, median, Q3, maximum value]

**Geometric setting**—independent observations, each of which succeeds or fails with the same probability *p*; number of trials needed until first success is variable of interest

**Histogram**—graph in which numerical data are grouped into intervals and the frequencies or relative frequencies within each interval are displayed with bars

**Homogeneity of proportions**—chi-square hypothesis in which proportions of a categorical variable are tested for homogeneity across two or more populations

**Independent events**—knowing one event occurs does not change the probability that the other occurs;

**Independent variable**—see **explanatory variable**

**Inferential statistics**—use of sample data to make inferences about populations

**Influential observation**—observation, usually in the *x* direction, whose removal would have a marked impact on the slope of the regression line

**Interpolation**—predictions about the value of a variable based on the value of another variable within the range of measured values

**Interquartile range**—value of the third quartile minus the value of the first quartile; contains middle 50% of the data

**Least-squares regression line**—of all possible lines, the line that minimizes the sum of squared errors (residuals) from the line

**Line of best fit**—see **least-squares regression line**

**Lurking variable**—one that has an effect on the outcomes of the study but whose influence was not part of the investigation

**Margin of error**—measure of uncertainty in the estimate of a parameter; (critical value) · (standard error)

**Marginal totals**—row and column totals in a two-way table

**Matched pairs**—experimental units paired by a researcher based on some common characteristic or characteristic

**Matched pairs design**—experimental design that utilizes each pair as a block; one unit receives one treatment, and the other unit receives the other treatment

**Mean**—sum of all the values in a dataset divided by the number of values

**Median**—halfway through an ordered dataset, below and above which lies an equal number of data values; 50th percentile

**Mode**—most common value in a distribution

**Mound-shaped (bell-shaped)**—distribution in which data values tend to cluster about the center of the distribution; characteristic of a normal distribution

**Mutually exclusive events**—events that cannot occur simultaneously; if one occurs, the other doesn’t

**Negatively associated**—larger values of one variable are associated with smaller values of the other; see **association**

**Nonresponse bias**—occurs when subjects selected for a sample do not respond

**Normal curve**—familiar bell-shaped density curve; symmetric about its mean; defined in terms of its mean and standard deviation;

**Normal distribution**—distribution of a random variable *X* so that *P*(*α* < *X* < *b*) is the area under the normal curve between *α* and *b*

**Null hypothesis**—hypothesis being tested—usually a statement that there is no effect or difference between treatments; what a researcher wants to disprove to support his/her alternative

**Numerical data**—see **quantitative data**

**Observational study**—when variables of interest are observed and measured but no treatment is imposed in an attempt to influence the response

**Observed values**—counts of outcomes in an experiment or study; compared with expected values in a chi-square analysis

**One-sided alternative**—alternative hypothesis that varies from the null in only one direction

**One-sided test**—used when an alternative hypothesis states that the true value is either less than or greater than the hypothesized value but not both

**Outcome**—simple events in a probability experiment

**Outlier**—a data value that is far removed from the general pattern of the data

** P(A and B)**—probability that

*both*A and B occur;

*P*(A and B) =

*P*(A) ·

*P*(A|B)

** P(A or B)**—probability that

*either*A or B occurs;

*P*(A or B) =

*P*(A) +

*P*(B) –

*P*(A and B)

** P-value**—probability of getting a sample value at least as extreme as that obtained by chance alone assuming the null hypothesis is true

**Parameter**—measure that describes a population

**Percentile rank**—proportion of terms in the distributions less than the value being considered

**Placebo**—an inactive procedure or treatment

**Placebo effect**—effect, often positive, attributable to the patient’s expectation that the treatment will have an effect

**Point estimate**—value based on sample data that represents a likely value for a population parameter

**Positively associated**—larger values of one variable are associated with larger values of the other; see **association**

**Power of a test**—probability of rejecting a null hypothesis against a specific alternative

**Probability distribution**—identification of the outcomes of a random variable together with the probabilities associated with those outcomes

**Probability histogram**—histogram for a probability distribution; horizontal axis shows the outcomes, vertical axis shows the probabilities of those outcomes

**Probability of an event**—relative frequency of the number of ways an event can succeed to the total number of ways it can succeed or fail

**Probability sample**—sampling technique that uses a random mechanism to select the members of the sample

**Proportion**—ratio of the count of a particular outcome to the total number of outcomes

**Qualitative data**—data whose values range over categories rather than values. See also **categorical data**

**Quantitative data**—data whose values are numerical

**Quartiles**—25th, 50th, and 75th percentiles of a dataset

**Random phenomenon**—unclear how any one trial will turn out, but there is a regular distribution of outcomes in a large number of trials

**Random sample**—sample in which each member of the sample is chosen by chance and each member of the population has an equal chance to be in the sample

**Random variable**—numerical outcome of a random phenomenon (random experiment)

**Randomization**—random assignment of experimental units to treatments

**Range**—difference between the maximum and minimum values of a dataset

**Replication**—repetition of each treatment enough times to help control for chance variation

**Representative sample**—sample that possesses the essential characteristics of the population from which it was taken

**Residual**—in a regression, the actual value minus the predicted value

**Resistant statistic**—one whose numerical value is not influenced by extreme values in the dataset

**Response bias**—bias that stems from respondents’ inaccurate or untruthful response

**Response variable**—measures the outcome of a study

**Robust**—when a procedure may still be useful even if the conditions needed to justify it are not completely satisfied

**Robust procedure**—procedure that still works reasonably well even if the assumptions needed for it are violated; the *t*-procedures are robust against the assumption of normality as long as there are no outliers or severe skewness.

**Sample space**—set of all possible mutually exclusive outcomes of a probability experiment

**Sample survey**—using a sample from a population to obtain responses to questions from individuals

**Sampling distribution of a statistic**—distribution of all possible values of a statistic for samples of a given size

**Sampling frame**—list of experimental units from which the sample is selected

**Scatterplot**—graphical representation of a set of ordered pairs; horizontal axis is first element in the pair, vertical axis is the second

**Shape**—geometric description of a dataset: mound-shaped; symmetric, uniform; skewed; etc.

**Significance level (**α**)**—probability value that, when compared to the *P*-value, determines whether a finding is statistically significant

**Simple random sample (SRS)**—sample in which all possible samples of the same size are equally likely to be the sample chosen

**Simulation**—random imitation of a probabilistic situation

**Skewed**—distribution that is asymmetrical with data bunched at one end and a tail stretching out in the other

**Skewed left (right)**—asymmetrical with more of a tail on the left (right) than on the right (left)

**Spread**—variability of a distribution

**Standard deviation**—square root of the variance;

**Standard error**—estimate of population standard deviation based on sample data

**Standard normal distribution**—normal distribution with a mean of 0 and a standard deviation of 1

**Standard normal probability**—normal probability calculated from the standard normal distribution

**Statistic**—measure that describes a sample (e.g., sample mean)

**Statistical control**—holding constant variables in an experiment that might affect the response but are not one of the treatment variables

**Statistically significant**—a finding that is unlikely to have occurred by chance

**Statistics**—science of data

**Stemplot (stem-and-leaf plot)**—graph in which ordinal data are broken into “stems” and “leaves”; visually similar to a histogram except that all the data are retained

**Stratified random sample**—groups of interest (strata) chosen in such a way that they appear in approximately the same proportions in the sample as in the population

**Subjects**—human experimental units

**Survey**—obtaining responses to questions from individuals

**Symmetric**—data values distributed equally above and below the center of the distribution

**Systematic bias**—the mean of the sampling distribution of a statistic does not equal the mean of the population; see **unbiased estimate**

**Systematic sample**—probability sample in which one of the first *n* subjects is chosen at random for the sample and then each *n*th person after that is chosen for the sample

** t-distribution**—the distribution with

*n*– 1 degrees of freedom for the

*t*statistic
*t* statistic—

**Test statistic**—

**Third quartile**—the value which identifies the 75th percentile; the value that has at least 75% of the data at or below it and at least 25% of the data at or above it.

**Treatment variable**—see **explanatory variable**

**Tree diagram**—graphical technique for showing all possible outcomes in a probability experiment

**Two-sided alternative**—alternative hypothesis that can vary from the null in either direction; values much greater than or much less than the null provide evidence against the null

**Two-sided test**—a hypothesis test with a **two-sided alternative**

**Two-way table**—table that lists the outcomes of two categorical variables; the values of one category are given as the row variable, and the values of the other category are given as the column variable; also called a contingency table

**Type I error**—the error made when a true null hypothesis is rejected

**Type II error**—the error made when a false null hypothesis is not rejected

**Unbiased estimate**—mean of the sampling distribution of the estimate equals the parameter being estimated

**Undercoverage**—some groups in a population are not included in a sample from that population

**Uniform**—distribution in which all data values have the same frequency of occurrence

**Univariate data**—having to do with a single variable

**Variance**—average of the squared deviations from their mean of a set of observations;

**Voluntary response bias**—bias inherent when people choose to respond to a survey or poll; bias is typically toward opinions of those who feel most strongly

**Voluntary response sample**—sample in which participants are free to respond or not to a survey or a poll

**Wording bias**—creation of response bias attributable to the phrasing of a question

** z-score**—number of standard deviations a term is above or below the mean;