What are the advantages and limitations of the range as a measure of dispersion

Measure of Dispersion

  1. Range
  2. Variance
  3. Standard Deviation
  4. IQR (Inter Quartile Range)
  5. Skewness
  6. Kurtosis

Range:It is the given measure of how spread apart the values in a data set are. It is measured as= (highest value — lowest value) of the variable.

Range = Max- Min

Range

2. Variance

Variance is calculated by taking the differences between each number in the data set and the mean, then squaring the differences to make them positive, and finally dividing the sum of the squares by the number of values in the data set.

Variance

For any Sample, always the sum of deviations from mean or average is equal to 0. This is one of the constraint we have on any sample data.

Variance is measure to quantify degree of dispersion of each observation from mean values.

Advantages and Disadvantages of Variance

Advantages:

a. Statisticians use variance to see how individual numbers relate to each other within a data set, rather than using broader mathematical techniques such as arranging numbers into quartiles.

b. The advantage of variance is that it treats all deviations from the mean the same regardless of their direction. The squared deviations cannot sum to zero and give the appearance of no variability at all in the data.

Disadvantages:

a. One drawback to variance is that it gives added weight to outliers, the numbers that are far from the mean. Squaring these numbers can skew the data.

The drawback of variance is that it is not easily interpreted. Users of variance often employ it primarily in order to take the square root of its value, which indicates the standard deviation of the data set.

Degree of Freedom

Degree of Degrees of freedom of an estimate is the number of independent pieces of information that went into calculating the estimate. It’s not quite the same as the number of items in the sample. In order to get the df for the estimate, you have to subtract 1 from the number of items. Let’s say you were finding the mean weight loss for a low-carb diet. You could use 4 people, giving 3 degrees of freedom (4–1 = 3), or you could use one hundred people with df = 99.

n= no. of items in your set.

Degree of Freedom = n-1

Degree of Freedom for Population

Consider a population of size ’N’. There are no constraints on any population. So the degree of population remains ’N’ only.

Degree of Freedom for Sample

Consider a sample of size’n’ , and there is always constraint on every sample i.e. sum of deviation = 0. So max degree of freedom for any sample is (n-1).

n=1000

df= (1000–1)

df= 999

Variance of Population:

Population variance (σ2) tells us how data points in a specific population are spread out. It is the average of the distances from each data point in the population to the mean, squared.

Population Variance

Variation for Sample

The variance is mathematically defined as the average of the squared differences from the mean. In order to understand what you are calculating with the variance, break it down into steps: Step 1: Calculate the mean (the average weight). Step 2: Subtract the mean and square the result.

Sample Variance

Standard Deviation :

The standard deviation is a statistic that measures the dispersion of a dataset relative to its mean and is calculated as the square root of the variance. The standard deviation is calculated as the square root of variance by determining each data point’s deviation relative to the mean. If the data points are further from the mean, there is a higher deviation within the data set; thus, the more spread out the data, the higher the standard deviation.

Standard Deviation

Here are the steps to calculate the standard deviation:
1. Compute the mean.
2. For each data value, calculate its deviation from the mean. The deviation from the mean is determined by subtracting the mean from the data value. Note that if we added all these deviations from the mean for one dataset, the sum would be 0 (or close, depending on round-off error).
3. Square each deviation from the mean.
4. Sum the squares of the deviations.
5. Divide the sum in #4 by (n — 1). Note that the text says,” there are important statistical reasons we divide by one less than the number of data values.”
6. Take the square root of the value in #5, which will give the standard deviation.

Now, let’s look at an example where standard deviation helps explain the data.

Consider the following three datasets:
(1) 5, 25, 25, 25, 25, 25, 45
(2) 5, 15, 20, 25, 30, 35, 45
(3) 5, 5, 5, 25, 45, 45, 45

The mean, median, and range are all the same for these datasets, but the variability of each dataset is quite different.

In the process of variable selection, we can look at those variable whose standard deviation is equal to 0 and we can ignore such independent variables.

Note : When the client insist to have all the variable which he thinks are important, then we cannot directly ignore such variables even though their standard deviation is equal to 0. In such cases we might have to add systematic noise to such variables whose standard deviation = 0.

IQR(Inter Quartile Range)

The interquartile range (IQR) is a measure of variability, based on dividing a data set into quartiles. The values that divide each part are called the first, second, and third quartiles; and they are denoted by Q1, Q2, and Q3, respectively.

  • Q1 is the “middle” value in the first half of the rank-ordered data set.
  • Quartile 1 : 25th percentile
  • Q2 is the median value in the set.
  • Quartile 2 : 50th percentile
  • Q3 is the “middle” value in the second half of the rank-ordered data set.
  • Quartile 3: 75th percentile

Box Plots

Are visual representation of data which can help us in finding Q1, Q2 and Q3.

Box plots (also called box-and-whisker plots or box-whisker plots) give a good graphical image of the concentration of the data. They also show how far the extreme values are from most of the data. A box plot is constructed from five values: the minimum value, the first quartile, the median, the third quartile, and the maximum value. We use these values to compare how close other data values are to them.

Outlier is a value that lies in a data series on its extremes, which is either very small or large and thus can affect the overall observation made from the data series.

Inter Quartile Range

Rank = (n+1)/P

n = number of observations

P = Corresponding Percentile

Example of IQR

Question. Consider below Data and find out if there is any OutLiers .

32,980,12567,33000,99000,545,1256,9898,12568,32984

Answer:

Step 1: We arrange these observations in ascending order

For all observations

32 R1

545 R2

980 R3

1256 R4

9898 R5

12567 R6

12568 R7

32984 R8

33000 R9

99000 R10

Step 2:

We need to calculate Q1, Q2 and Q3

Q1 = 25th percentile

Q2 = 50th percentile

Q3 = 75th percentile

Calculation of Q1,Q2 and Q3

Step 3

Calculate IQR = Q3- Q1

IQR = 32996- 653.75

IQR = = 32342.25

Step 4

Calculate Lower and Upper Boundary

For Lower Boundary

Lower boundary = Q1–1.5 *(IQR)

LB = 653.75–1.5(32342.25)

LB=— 47859.625

For Upper Boundary

UB= Q3 + 1.5 *(IQR)

UB = 32996 + 1.5(32342.25)

UB= 81509.375

Step 5

Lets Now Represent It in a Diagramitically . as 99000 falls outside of the upper Boundary . So it Is a Outlier.

Outlier = 99000

Skewness

It is the degree of distortion from the symmetrical bell curve or the normal distribution.It measures the lack of symmetry in data distribution . A symmetrical distribution will have a skewness of 0 . When the skewness is 0 i.e when distribution is not skewed then the centrality measure used is mean. Usually in this case mean and median are equal.

Symmetrical Distribution

Positive Skewness:means when the tail on the right side of the distribution is longer or fatter. In this case mean is larger than median.

Example : Distribution of Income- If the distribution of the household incomes of a region is studied, from values ranging between $5,000 to $250,000, most of the citizens fall in the group between $5,000 and $100,000, which forms the bulk of the distribution towards the left side of the distribution, which is the lower side. However, a couple of individuals may have a very high income, in millions. This makes the tail of extreme values (high income) extend longer towards the positive, or right side. Thus, it is a positively skewed distribution.

Positive Skew

Negative Skewness :

Negative Skewness is when the tail of the left side of the distribution is longer or fatter than the tail on the right side. In this case mean is smaller than median. In both positive and negative skewed cases median will be preferred over mean.

Example : Retirement Age — When the retirement age of employees is compared, it is found that most retire in their mid-sixties, or older. Thus, the distribution of most people will be near the higher extreme, or the right side. However, there is an increasingly new trend in which very few people are retiring early, and that too at very young ages. This will make the tail of the distribution longer towards the left side or the lower side, and the less values (low ages) will shift the mean towards the left, making it a negatively skewed distribution.

Negative Skew

● If the skewness is between -0.5 and 0.5, the data are fairly symmetrical.

● If the skewness is between -1 and -0.5(negatively skewed) or between 0.5 and 1(positively skewed), the data are moderately skewed.

● If the skewness is less than -1(negatively skewed) or greater than 1(positively skewed), the data are highly skewed.

Kurtosis

It is the sharpness of the peak of a frequency-distribution curve.It is actually the measure of outliers present in the distribution.

High kurtosis in a data set is an indicator that data has heavy outliers.

Low kurtosis in a data set is an indicator that data has lack of outliers.

Mesokurtic : This distribution has kurtosis statistic similar to that of the normal distribution.

Leptokurtic (Kurtosis > 3) : Peak is higher and sharper than Mesokurtic, which means that data has heavy outliers.

Platykurtic (Kurtosis < 3): The peak is lower and broader than Mesokurtic, which means that data has a lack of outliers

What are the advantages and limitations of the range as measure of dispersion?

The range is the difference between the largest and the smallest observation in the data. The prime advantage of this measure of dispersion is that it is easy to calculate. On the other hand, it has lot of disadvantages. It is very sensitive to outliers and does not use all the observations in a data set.

What is a limitation of the range as measure of dispersion?

Limitations of using Range as a Measure of Spread or Dispersion. Range is not based on all the observations of the series. It takes into account only the most extreme cases. It helps us to make only a rough comparison of two or more groups of variability. The range takes into account the two extreme scores in a series.

What are the limitations of the range?

Demerits or Limitations or Drawbacks:.
Range is not based on all the terms. ... .
Due to above reason range is not a reliable measure of dispersion..
Range does not change even the least even if all other, in between, terms and variables are changed… ... .
Range is too much affected by fluctuation of sampling..

What are the two advantages of range?

Range as a measure is easy to understand..
It is simple to calculate..
Range is widely used in some statistical series like change in interest rate, change in production etc..