Outliers and z-score

The outliers in a data set can bias the mean and inflate the standard deviation. Screening data is an important way to detect them. Generally speaking, there are two ways to detect outliers: 1) graph the data with a histogram 2) look at z-score.

A z-score indicates how many standard deviations an element is from the mean. A z-score can be calculated from the following formula:

z = (X – μ) / σ

where z is the z-score, X is the value of the element, μ is the population mean, and σ is the standard deviation.

The following are some examples of interpreting z-score:

  1. A z-score less than 0 represents an element less than the mean.
  2. A z-score greater than 0 represents an element greater than the mean.
  3. A z-score equal to 0 represents an element equal to the mean.
  4. A z-score equal to 1 represents an element that is 1 standard deviation greater than the mean; a z-score equal to 2, 2 standard deviations greater than the mean; etc.
  5. A z-score equal to -1 represents an element that is 1 standard deviation less than the mean; a z-score equal to -2, 2 standard deviations less than the mean; etc.
  6. If the number of elements in the set is large, about 68% of the elements have a z-score between -1 and 1; about 95% have a z-score between -2 and 2; and about 99% have a z-score between -3 and 3.

The assumption of calculation of z score is that the frequency distribution of the data is normally distributed. We can find out the probability of a score occurring by standardising the scores, known as z scores.

If the probability of a score occurring is too small, then it has a high possibility to be an outlier.

The following code can be used to detect outliers by looking at z-score:

outlierSummary<-function(variable, digits = 2){
	
	zvariable<-(variable-mean(variable, na.rm = TRUE))/sd(variable, na.rm = TRUE)
		
	outlier95= 1.96
	outlier99= 2.58
	outlier999= 3.29
	
	ncases<-length(na.omit(zvariable))
	
	percent95<-round(100*length(subset(outlier95, outlier95 == TRUE))/ncases, digits)
	percent99<-round(100*length(subset(outlier99, outlier99 == TRUE))/ncases, digits)
	percent999<-round(100*length(subset(outlier999, outlier999 == TRUE))/ncases, digits)
	
	cat("Absolute z-score greater than 1.96 = ", percent95, "%", "\n")
	cat("Absolute z-score greater than 2.58 = ",  percent99, "%", "\n")
	cat("Absolute z-score greater than 3.29 = ",  percent999, "%", "\n")
}

One thought on “Outliers and z-score

  1. Thanks on your marvelous posting! I seriously enjoyed reading it, you may be a great author.I will make sure to bookmark your blog and definitely will come back later in life. I want to encourage you to ultimately continue your great writing, have a nice afternoon!

Leave a comment