Glossary

Outlier

An outlier is an observation that deviates markedly from other values in a dataset. Outliers may arise from genuine extreme values, measurement errors, data-entry mistakes, or technical malfunctions. Identifying outliers is a critical step in data cleaning and exploratory anal...

Definition

An outlier is an observation that deviates markedly from other values in a dataset. Outliers may arise from genuine extreme values, measurement errors, data-entry mistakes, or technical malfunctions. Identifying outliers is a critical step in data cleaning and exploratory analysis.

Why It Matters

A single outlier can dramatically distort the mean, standard deviation, correlation, and regression coefficients. Failing to investigate outliers can lead to false conclusions or mask true patterns. Conversely, automatically discarding outliers without justification is poor practice and may bias results. The goal is to understand why an outlier exists and to handle it transparently.

Example

A dataset of 50 students' exam scores ranges from 45 to 95, except for one score of 5. Investigation reveals that this student was absent for most of the exam due to illness. The researcher decides to report analyses both with and without this case, noting in the methods that one outlier was excluded due to a documented special circumstance. This transparent approach preserves credibility.

Related Terms

Software Notes

  • SPSS: Analyze > Descriptive Statistics > Explore produces box plots that flag outliers. The interquartile range (IQR) method defines outliers as values beyond 1.5 × IQR from the quartiles. Graphs > Chart Builder > Boxplot visualises outliers.
  • R: boxplot(x)$out extracts outliers from a boxplot. boxplot.stats(x)$out does the same. outliers::scores(x) computes outlier scores using various methods.
  • Stata: graph box x produces a box plot with outliers marked as individual points. egen lower = pctile(x), p(25) and egen upper = pctile(x), p(75) compute quartiles for manual IQR-based outlier flagging.