Statistics revision notes for JEE

Key Concepts & Definitions

Statistics is the science of averages and their estimates, dealing with data collected for specific purposes to make decisions through analysis and interpretation.

1. Measures of Central Tendency A representative value that gives a rough idea of where data points are centered.

Mean (Arithmetic Mean):: The sum of observations divided by the number of observations.
Median:: The middlemost value when data is arranged in ascending or descending order. If nnn is odd, it is the (n+12)th\left(\frac{n+1}{2}\right)^{th}(2n+1)th observation. If nnn is even, it is the mean of the (n2)th\left(\frac{n}{2}\right)^{th}(2n)th and (n2+1)th\left(\frac{n}{2} + 1\right)^{th}(2n+1)th observations.
Mode:: The most frequently occurring observation.

2. Variability & Measures of Dispersion Central tendency alone is insufficient to give complete information about data. Variability describes how scattered or bunched the data is around the central measure. A single number describing this variability is called a measure of dispersion. Types of measures of dispersion:

Range: The difference between the maximum and minimum values in a series.
Quartile Deviation: Half the difference between the upper and lower quartiles (Semi-interquartile range).
Mean Deviation: The mean of the absolute values of the deviations of observations from a central value (mean or median).
Standard Deviation: The positive square root of the variance, acting as the most reliable indicator of dispersion.

3. Variance The mean of the squares of the deviations from the mean. It overcomes the mathematical difficulties of the Mean Deviation (which uses absolute values) by squaring the deviations to ensure non-negativity. $\rightarrow$ [JEE TIP] Variance is independent of the change of origin but depends on the change of scale.

4. Coefficient of Variation (CV) A dimensionless, relative measure of dispersion used to compare the variability of two or more distributions. $CV = \frac{\sigma}{\bar{x}} \times 100$ . A distribution with a lower CV is termed more consistent, uniform, or stable. $\rightarrow$ [JEE TIP] Always use CV, not standard deviation, to compare the consistency of two distinct datasets (e.g., comparing runs of two batsmen).

5. Historical Context (Theoretical Statistics)

A.L. Bowley & A.L. Boddington: Defined statistics as the science of averages/estimates.
Kautilya (300 B.C.): Maintained vital statistics in Arthshastra.
Francis Galton: Pioneered biometry statistics.
Karl Pearson: Discovered the Chi-square test and founded the first statistical laboratory.
Sir Ronald A. Fisher: Father of modern statistics.

Formulae, Equations & Units

1. Mean ( $\bar{x}$ )

Ungrouped data: $\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i$ .
Discrete frequency distribution: $\bar{x} = \frac{1}{N} \sum_{i=1}^{n} f_i x_i$ , where $N = \sum f_i$ .
Step-deviation method (Shortcut): $\bar{x} = A + \frac{\sum f_i y_i}{N} \times h$ , where $A$ is the assumed mean, $h$ is the class width, and $y_i = \frac{x_i - A}{h}$ .

2. Median (M) for Continuous Frequency Distribution

$M = l + \left( \frac{\frac{N}{2} - C}{f} \right) \times h$ $M = l + (\frac{\frac{N}{2} - C}{f}) \times h$
- $l$ = lower limit of median class (the class whose cumulative frequency is $\geq N/2$ )
- $N$ = total frequency
- $C$ = cumulative frequency of the class preceding the median class
- $f$ = frequency of the median class
- $h$ = width of the median class

3. Mean Deviation (M.D.)

About a general value $a$ : $M.D.(a) = \frac{1}{n} \sum |x_i - a|$ .
About Mean (Ungrouped): $M.D.(\bar{x}) = \frac{1}{n} \sum |x_i - \bar{x}|$ .
About Median (Ungrouped): $M.D.(M) = \frac{1}{n} \sum |x_i - M|$ .
About Mean (Grouped): $M.D.(\bar{x}) = \frac{1}{N} \sum f_i |x_i - \bar{x}|$ .
About Median (Grouped): $M.D.(M) = \frac{1}{N} \sum f_i |x_i - M|$ .

4. Variance ( $\sigma^2$ )

Ungrouped data (Basic definition): $\sigma^2 = \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2$ .
Ungrouped data (Calculation shortcut): $\sigma^2 = \frac{1}{n} \sum_{i=1}^{n} x_i^2 - (\bar{x})^2$ . $\rightarrow$ [JEE TIP] Always use this form in JEE calculations to save time.
Discrete/Continuous Grouped data: $\sigma^2 = \frac{1}{N} \sum_{i=1}^{n} f_i (x_i - \bar{x})^2$ .
Grouped data (Calculation shortcut): $\sigma^2 = \frac{1}{N^2} \left[ N \sum f_i x_i^2 - \left(\sum f_i x_i\right)^2 \right]$ .
Step-deviation method: $\sigma_x^2 = h^2 \sigma_y^2 = \frac{h^2}{N^2} \left[ N \sum f_i y_i^2 - \left(\sum f_i y_i\right)^2 \right]$ .

5. Standard Deviation ( $\sigma$ )

It is simply the positive square root of variance: $\sigma = \sqrt{\sigma^2}$ .

6. Combined Mean and Combined Variance (JEE Advanced Focus) If two groups of sizes $n_1$ and $n_2$ have means $\bar{x}_1, \bar{x}_2$ and variances $\sigma_1^2, \sigma_2^2$ :

Combined Mean: $\bar{x}_c = \frac{n_1 \bar{x}_1 + n_2 \bar{x}_2}{n_1 + n_2}$
Combined Variance: $\sigma_c^2 = \frac{n_1(\sigma_1^2 + d_1^2) + n_2(\sigma_2^2 + d_2^2)}{n_1 + n_2}$ , where $d_1 = \bar{x}_1 - \bar{x}_c$ and $d_2 = \bar{x}_2 - \bar{x}_c$ .

Conditions & Limitations

Mean Deviation: Cannot be subjected to further algebraic/calculus treatment because it involves absolute values (modulus function is not differentiable everywhere).
Mean Deviation about Median: If the degree of variability in a series is very high, the median is not a reliable representative central tendency. Thus, M.D. about median cannot be fully relied upon in highly dispersed sets.
Sum of Deviations: The sum of standard deviations without absolute values from the mean is always zero ( $\sum (x_i - \bar{x}) = 0$ ). Hence, finding the simple average of deviations is useless for dispersion.
Continuous Distributions: The formulas assume that the frequency in each class is entirely centered at its mid-point ( $x_i$ ). This is an approximation.

⚠️ COMMON MISCONCEPTIONS & SIGN CONVENTIONS

Trap of Scale vs. Origin:
- Adding or subtracting a constant $a$ from every observation (change of origin) changes the Mean ( $\bar{y} = \bar{x} \pm a$ ) but DOES NOT change Variance or Standard Deviation ( $\sigma_y = \sigma_x$ ).
- Multiplying or dividing every observation by a constant $k$ (change of scale) changes the Mean ( $\bar{y} = k\bar{x}$ ), changes Variance by $k^2$ ( $\sigma_y^2 = k^2 \sigma_x^2$ ), and changes Standard Deviation by $|k|$ ( $\sigma_y = |k|\sigma_x$ ).
- Linear transformation $y_i = ax_i + b$ : $\bar{y} = a\bar{x} + b$ , and $Var(y) = a^2 Var(x)$ . $\rightarrow$ [JEE TIP] Note that $b$ vanishes completely for variance.
Sum of Squares Trap: The term $\sum x_i^2$ is the sum of the squares of the observations. This is strictly distinct from $(\sum x_i)^2$ , which is the square of the sum. Do not confuse them in the variance formula $\frac{\sum x_i^2}{n} - (\frac{\sum x_i}{n})^2$ .
Modulus in Mean Deviation: Many students forget to drop the negative signs when calculating Mean Deviation. M.D. demands absolute values $|x_i - \bar{x}|$ , not brackets.

Previous Year JEE Topics

Correcting Incorrect Observations: Given an incorrect mean and variance due to misread data points, calculating the exact correct mean and variance.
Linear Transformations: If a dataset undergoes operations like $x_{new} = \alpha x_{old} + \beta$ , computing the new variance/SD.
Variance of Natural Numbers: Specifically the variance of the first $n$ natural numbers, arithmetic progressions, or geometric progressions.
Combined Variance: Questions providing data for boys vs. girls in a class and asking for the aggregate variance.

Standard Derivations & Step-by-Step Problem Solving

Correcting Incorrect Observations (Frequent JEE Profile) Scenario: A student calculates the mean and variance of $n$ observations but later discovers that one observation $x_{wrong}$ was recorded incorrectly and should be $x_{correct}$ .

Step 1: Calculate Incorrect Sum of observations: $\sum x_{inc} = n \times \bar{x}_{inc}$
Step 2: Calculate Correct Sum: $\sum x_{cor} = \sum x_{inc} - x_{wrong} + x_{correct}$
Step 3: Calculate Correct Mean: $\bar{x}_{cor} = \frac{\sum x_{cor}}{n}$
Step 4: Use the variance formula to find Incorrect Sum of Squares: $\sigma_{inc}^2 = \frac{\sum x_{inc}^2}{n} - (\bar{x}_{inc})^2 \implies \sum x_{inc}^2 = n(\sigma_{inc}^2 + \bar{x}_{inc}^2)$
Step 5: Calculate Correct Sum of Squares: $\sum x_{cor}^2 = \sum x_{inc}^2 - (x_{wrong})^2 + (x_{correct})^2$
Step 6: Calculate Correct Variance: $\sigma_{cor}^2 = \frac{\sum x_{cor}^2}{n} - (\bar{x}_{cor})^2$

Variance of First $n$ Natural Numbers Let $x_i = i$ for $i = 1, 2, \ldots, n$ . Mean: $\bar{x} = \frac{1}{n} \sum i = \frac{n(n+1)}{2n} = \frac{n+1}{2}$ . Sum of squares: $\sum x_i^2 = \frac{n(n+1)(2n+1)}{6}$ . Variance: $\sigma^2 = \frac{\sum x_i^2}{n} - (\bar{x})^2 = \frac{(n+1)(2n+1)}{6} - \left(\frac{n+1}{2}\right)^2 = \frac{n^2 - 1}{12}$ . $\rightarrow$ [JEE TIP] Memorize $\sigma^2 = \frac{n^2 - 1}{12}$ for first $n$ natural numbers.

Memory Aids & JEE Traps

$\rightarrow$ [JEE TIP] Trap 1 - "Variance of a constant sequence": If all observations are equal to $k$ , then variance is strictly $0$ (since $x_i - \bar{x} = 0$ ).
$\rightarrow$ [JEE TIP] Trap 2 - "Addition vs Multiplication effect on Variance": If $y_i = \frac{x_i - A}{h}$ , do not forget that $Var(x) = h^2 Var(y)$ . The constant $A$ vanishes, but the scaling factor $h$ is squared.
$\rightarrow$ [JEE TIP] Trap 3 - "Minimum sum of squared deviations": The sum of squared deviations $\sum (x_i - a)^2$ is mathematically minimized when $a$ is the Mean ( $\bar{x}$ ). If asked to minimize this sum, set $a = \bar{x}$ .
$\rightarrow$ [JEE TIP] Trap 4 - "Minimum sum of absolute deviations": The sum of absolute deviations $\sum |x_i - a|$ is mathematically minimized when $a$ is the Median.

Top 10 MCQ Traps

[JEE TIP] Trap 1 - The Origin Shift Variance Inertia:
- Misconception: Adding or subtracting a uniform constant from every data point in a distribution shifts and scales its variance.
- Correct Understanding: Measures of dispersion (Variance, Standard Deviation, and Range) are completely independent of a change of origin. Adding a constant shifts the entire distribution uniformly without altering the spread between points, meaning the variance remains absolutely unchanged. Only a change of scale (multiplication/division) affects dispersion.
[JEE TIP] Trap 2 - The Zero Mean Deviation Nullification:
- Misconception: Calculating the Mean Deviation (M.D.) involves evaluating the simple arithmetic average of standard data deviations: $\frac{\sum(x_i - \bar{x})}{n}$ .
- Correct Understanding: Mean Deviation strictly requires absolute values, denoted as $\frac{\sum|x_i - \bar{x}|}{n}$ . If you omit the modulus brackets, the positive and negative deviations around the arithmetic mean will perfectly balance out, causing the final summation to default to exactly zero every single time.
[JEE TIP] Trap 3 - Negative Standard Deviation Scaling:
- Misconception: Multiplying every data point in a set by a negative scaling factor like $-3$ scales the Standard Deviation ( $\sigma$ ) by that same factor of $-3$ .
- Correct Understanding: Standard Deviation represents a physical distance and must strictly be a non-negative real value. Scaling data by a factor of $k$ changes the standard deviation by its absolute magnitude: $\sigma_{\text{new}} = |k|\sigma_{\text{old}}$ . Multiplying by $-3$ increases the standard deviation by a positive factor of $|-3| = 3$ .
[JEE TIP] Trap 4 - The Combined Variance Mean Shift Deficit:
- Misconception: When merging two distinct datasets together, the combined variance of the total pool is simply the weighted average of their individual variances.
- Correct Understanding: The weighted average calculation completely misses the dispersion caused by the distance between the two distinct group means. The correct combined variance formula must explicitly incorporate mean displacement factors ( $d_1$ and $d_2$ ): $\sigma_c^2 = \frac{n_1(\sigma_1^2 + d_1^2) + n_2(\sigma_2^2 + d_2^2)}{n_1 + n_2}$ , where $d_1 = \bar{x}_1 - \bar{x}_c$ and $d_2 = \bar{x}_2 - \bar{x}_c$ .
[JEE TIP] Trap 5 - Absolute vs. Relative Dispersion Comparisons:
- Misconception: A dataset possessing a larger absolute Standard Deviation is automatically more variable or less stable than another dataset.
- Correct Understanding: You cannot directly compare the standard deviations of datasets that have radically different units or completely different means. To perform a valid relative variability comparison, you must evaluate the Coefficient of Variation (C.V.), which normalizes the metric: $\text{C.V.} = \frac{\sigma}{\bar{x}} \times 100$ . The dataset with the higher C.V. percentage is the one that is truly more variable.
[JEE TIP] Trap 6 - Variance Dimensional Unit Inflation:
- Misconception: Variance is measured and expressed in the exact same dimensional units (e.g., kg, meters, seconds) as the raw observations.
- Correct Understanding: Because the variance formula squares the distance values, its output is expressed in squared units (e.g., $\text{kg}^2$ , $\text{m}^2$ ). Only the Standard Deviation and the Mean Deviation, which apply a square root or handle raw linear distances, return the final value to the original dimensional units of the data.
[JEE TIP] Trap 7 - The Sum-Square Operator Algebraic Swap:
- Misconception: The notation representing the sum of squares ( $\sum x_i^2$ ) and the notation for the square of the sum ( $(\sum x_i)^2$ ) are algebraically interchangeable.
- Correct Understanding: These are fundamentally distinct operations. $\sum x_i^2$ squares each element individually before adding them, whereas $(\sum x_i)^2$ sums all elements first before squaring the total pool. Mixing them up will completely corrupt the standard variance computational formula: $\sigma^2 = \frac{\sum x_i^2}{n} - \left(\frac{\sum x_i}{n}\right)^2$ .
[JEE TIP] Trap 8 - The Median Class Midpoint Fallacy:
- Misconception: In a continuous frequency distribution table, the median value can be estimated by choosing the exact midpoint of the designated median class.
- Correct Understanding: The midpoint of a class is used for evaluating the mean of grouped data, not the median. Finding the median of a continuous frequency distribution requires performing linear interpolation across the cumulative frequency distribution curve using the precise formula: $\text{Median} = l + \left(\frac{\frac{N}{2} - \text{C}}{f}\right) \times h$ .
[JEE TIP] Trap 9 - The Sum of Unsigned Deviations Illusion:
- Misconception: Summing up the raw differences between each individual observation and the mean ( $\sum(x_i - \bar{x})$ ) serves as a valid approach to measure the total dispersion of a dataset.
- Correct Understanding: This sum provides no information about dispersion because the algebraic sum of deviations from the arithmetic mean is universally equal to zero. The negative values below the mean always cancel out the positive values above it. To measure dispersion, you must use absolute values or square the differences.
[JEE TIP] Trap 10 - Linear Variance Scaling Delusion:
- Misconception: If every observation in a dataset is divided by a factor of $2$ , the original variance of the system drops linearly by half (e.g., a variance of $5$ becomes $2.5$ ).
- Correct Understanding: Variance scales quadratically with respect to multiplication or division. If every observation is scaled by a factor of $\frac{1}{k}$ , the variance scales by a factor of $\frac{1}{k^2}$ . Therefore, dividing every data point by $2$ alters the variance by a factor of $\frac{1}{2^2} = \frac{1}{4}$ , transforming an initial variance of $5$ into exactly $\frac{5}{4} = 1.25$ .