Outliers and z-score

The outliers in a data set can bias the mean and inflate the standard deviation. Screening data is an important way to detect them. Generally speaking, there are two ways to detect outliers: 1) graph the data with a histogram 2) look at z-score.

A z-score indicates how many standard deviations an element is from the mean. A z-score can be calculated from the following formula:

z = (X – μ) / σ

where z is the z-score, X is the value of the element, μ is the population mean, and σ is the standard deviation.

The following are some examples of interpreting z-score:

A z-score less than 0 represents an element less than the mean.
A z-score greater than 0 represents an element greater than the mean.
A z-score equal to 0 represents an element equal to the mean.
A z-score equal to 1 represents an element that is 1 standard deviation greater than the mean; a z-score equal to 2, 2 standard deviations greater than the mean; etc.
A z-score equal to -1 represents an element that is 1 standard deviation less than the mean; a z-score equal to -2, 2 standard deviations less than the mean; etc.
If the number of elements in the set is large, about 68% of the elements have a z-score between -1 and 1; about 95% have a z-score between -2 and 2; and about 99% have a z-score between -3 and 3.

The assumption of calculation of z score is that the frequency distribution of the data is normally distributed. We can find out the probability of a score occurring by standardising the scores, known as z scores.

If the probability of a score occurring is too small, then it has a high possibility to be an outlier.

The following code can be used to detect outliers by looking at z-score:

outlierSummary&lt;-function(variable, digits = 2){
	
	zvariable&lt;-(variable-mean(variable, na.rm = TRUE))/sd(variable, na.rm = TRUE)
		
	outlier95= 1.96
	outlier99= 2.58
	outlier999= 3.29
	
	ncases&lt;-length(na.omit(zvariable))
	
	percent95&lt;-round(100*length(subset(outlier95, outlier95 == TRUE))/ncases, digits)
	percent99&lt;-round(100*length(subset(outlier99, outlier99 == TRUE))/ncases, digits)
	percent999&lt;-round(100*length(subset(outlier999, outlier999 == TRUE))/ncases, digits)
	
	cat(&quot;Absolute z-score greater than 1.96 = &quot;, percent95, &quot;%&quot;, &quot;\n&quot;)
	cat(&quot;Absolute z-score greater than 2.58 = &quot;,  percent99, &quot;%&quot;, &quot;\n&quot;)
	cat(&quot;Absolute z-score greater than 3.29 = &quot;,  percent999, &quot;%&quot;, &quot;\n&quot;)
}

One thought on “Outliers and z-score”

zhvdzyjv@gmail.com says:

March 24, 2015 at 12:55 pm

Thanks on your marvelous posting! I seriously enjoyed reading it, you may be a great author.I will make sure to bookmark your blog and definitely will come back later in life. I want to encourage you to ultimately continue your great writing, have a nice afternoon!

Share this:

One thought on “Outliers and z-score”

Leave a comment Cancel reply