Statistical Techniques for Machine Learning | AIML 3rd Semester Course Guide
Explore comprehensive explanations and solved examples of statistical techniques essential for machine learning in our AIML 3rd semester course. Learn about frequency distribution, central tendency measures, skewness, kurtosis, and more with practical examples and graphical methods. Ideal for students and professionals in data science.
Statistical 313307 Modelling for Machine Learning
Unit – I Statistical Techniques
1.1 Frequency Distribution
Definition: Frequency distribution is a table that displays the frequency of various outcomes in a sample. It shows how often each value occurs in a data set.
Basic Terms:
- Class Interval: A range within which data points are grouped.
- Frequency: The number of times a data point or range of data points occurs.
- Relative Frequency: The fraction or percentage of times a data point occurs in the data set.
- Cumulative Frequency: The sum of frequencies up to a certain class interval.
1.2 Classification of Data
- Raw Data: Unprocessed data collected from observations.
- Ungrouped Data: Data presented in its original form.
- Grouped Data: Data organized into intervals or classes, often represented in frequency tables.
1.3 Measures of Central Tendency
- Mean: The average of a set of values, calculated as the sum of all values divided by the number of values.
- Median: The middle value when the data is arranged in ascending or descending order. If there is an even number of values, the median is the average of the two middle values.
- Mode: The value that appears most frequently in a data set. There can be more than one mode or no mode at all.
1.4 Concept of Quartiles, Deciles, and Percentiles
- Quartiles: Divide the data into four equal parts. Q1 (25th percentile), Q2 (50th percentile, also the median), and Q3 (75th percentile).
- Deciles: Divide the data into ten equal parts. Each decile represents 10% of the data.
- Percentiles: Divide the data into 100 equal parts, with each percentile representing 1% of the data.
1.5 Geometric Mean, Harmonic Mean, and Combined Mean
- Geometric Mean: The nth root of the product of n values. Useful for data that is multiplicative in nature.
- Harmonic Mean: The reciprocal of the arithmetic mean of the reciprocals of the data values. Suitable for rates and ratios.
- Combined Mean: The weighted mean of two or more groups with different sample sizes.
1.6 Graphical Representation to Find Mode and Median
- Histogram: A bar graph representing the frequency distribution of data. Useful for identifying the mode.
- Ogive Curve: A cumulative frequency graph used to determine the median and understand data distribution.
1.7 Measures of Dispersion
- Range: Difference between the maximum and minimum values.
- Mean Deviation: Average of the absolute deviations from the mean.
- Standard Deviation: Square root of the variance, representing the average distance of each data point from the mean.
- Variance: The average of the squared deviations from the mean.
1.8 Skewness
- Types of Skewness:
- Positive Skew: Tail on the right side.
- Negative Skew: Tail on the left side.
- Test of Skewness: Methods to measure skewness using formulas.
- Karl Pearson’s Coefficient of Skewness: \(\frac{3(\bar{X} – \text{Median})}{\sigma}\)
- Bowley’s Coefficient of Skewness: \(\frac{Q3 + Q1 – 2 \times \text{Median}}{Q3 – Q1}\)
1.9 Types of Skewness in Terms of Mean and Mode
- Right Skew (Positive): Mean > Median > Mode
- Left Skew (Negative): Mean < Median < Mode
1.10 Measures of Kurtosis
- Kurtosis: Measures the “tailedness” of the data distribution.
- Excess Kurtosis: \(\frac{\text{Fourth Central Moment}}{(\text{Variance})^2} – 3\)
Theory Learning Outcomes (TLOs)
TLO 1.1 Solve Problems Based on Frequency Distribution
Example: Consider the following data: 3, 7, 5, 7, 2, 6, 4, 7, 3, 8. Construct a frequency distribution table.
Class Interval | Frequency | Cumulative Frequency |
---|---|---|
2-4 | 3 | 3 |
5-7 | 6 | 9 |
8-10 | 1 | 10 |
TLO 1.2 Calculate Mean, Median, and Mode for All Types of Data
Example: For the data set: 2, 4, 4, 4, 5, 7, 9
- Mean: \(\frac{2 + 4 + 4 + 4 + 5 + 7 + 9}{7} = 5\)
- Median: Middle value is 4 (since data is ordered and n = 7, which is odd)
- Mode: 4 (most frequent value)
TLO 1.3 Find Mode and Median Using Graphical Method
Example: Construct a histogram and ogive for the data set: 2, 3, 5, 7, 7, 8, 9.
- Mode: Identified as the highest bar in the histogram (value 7).