Monday, March 25, 2019

Statistics 101

What is Statistics
A branch of mathematics dealing with data collection, organization, analysis, interpretation and presentation.

Probability is a mathematical language to discuss uncertainties and it plays a key role in statistics.
In layman's term, statistics is a toolbox with methods to get answers from data.
Terms
Binomial random variable
  • A distribution of a sum of the squares of k independent standard normal random variables.
  • It is a special case of the gamma distribution
  • Is one of the most commonly used probability distributions in inferential statistics.
Chi-squared distribution with k-degrees of freedom
  • The random variable from the experiment that has only two possible values or outcomes.
Chi-square analysis
  • χ2 test is a hypothesis test where the sampling distribution of the test statistic is a chi-squared distribution when the null hypothesis is true.
  • In simple term, it often means 'Pearson's chi-squared test.
  • Used to determine whether there is a significant difference between the expected frequencies and the observed frequencies in one or more categories.
  • It is often constructed from a sum of squared errors or the sample variance.
  • It assumes the population has independent normally distributed data, which is valid due to central limit theorem.
Confidence interval
  • Estimate parameters of a population using a sample.
  • Use the mean x from sample to find a range of values that we can be confident to contain the mean of the population sampled
  • Lower bound = estimate - margin of error
  • Upper bound = estimate + margin of error
  • T-intervals - use it when population standard deviation is unknown and original population normal or sample size >= 30. This formulus use sample standard deviation instead of population standard deviation.
  • Z-intervals - use it when sample size >= 30 and population standard deviation known, or original population normal with the population standard deviation known.
Gamma distribution
  • Two parameter family of continuous probability distributions.
  • Exponential distribution, Erlang distribution, and chi-squared distribution are special cases of the gamma distribution.
  • Three different parametrizations in common use:
  • 1. With a shape parameter 𝞳 and a scale parameter 𝞡.
    2. With a shape parameter 𝞪 = 𝞳 and an inverse scale parameter 𝞫 = 1/𝞡.
    3. With a shape parameter 𝞳 and a mean parameter 𝞵 = 𝞳𝞡 = 𝞪/𝞫.
    
  • Gamma distribution is the maximum entropy probability distribution for a radom variable X for which E[X] = kθ = α/β is fixed and greater than 0.
  • E[ln(X)] = ψ(k) + ln(θ) = ψ(α) − ln(β) is fixed (ψ is the digamma function)
Hypothesis testing
  • Two types of errors in hypothesis testing: I and II.
  • Test and draw conclusions about the value of a parameter
  • Power analysis
  • Tests of proportion
  • P-value approach
Normal distribution
  • symmetrical about its mean.
  • Bell-shaped with a single peak at the center of the distribution.
  • Arithmetic mean is at the peak and at the center, with half the area above the mean and half under the mean.
  • It is asymptotic and the curve gets closer to the X-axis but never really touches it.
  • Mean, median and mode are equal
  • Curve extends to infinity theoretically
  • Standard Normal distribution has a mean of 0 and a standard deviation of 1
  • Z-score or Z-value is the distance between a selected value x, and the population mean mu, divided by the population standard deviation sigma.
  • z = (x-𝞵) / 𝞼
    
  • 68.26 % of the area under the normal curve is within one standard deviation of the mean. 𝞵 ± 𝞼
  • 95.44 % of the area under the normal curve is within two standard deviation of the mean. 𝞵 ± 2𝞼
  • 99.74 % of the area under the normal curve is within three standard deviation of the mean. 𝞵 ± 3𝞼
Probability distribution
  • all possible outcomes of an experiment and the corresponding probability
  • the sum of the probabilities of the various outcomes is 1.
  • The probability of a particular outcome is between 0 and 1
  • The standard deviation of particular probability is in inverse proportion to the sample size
Other basic terms
  • ANOVA - analysis of variance
  • degrees of freedom
  • mean E[x]
  • median
  • mode - most prevalent data points in the data set
  • normalize
  • outlier
  • p-value
  • parameter - any summary number that describes the population, ie average or percentage
  • population - any large collection of objects of interest.
  • r-squared - how much the regression function explains the variation in outcomes
  • random variable - numerical value determined by the outcome of an experiment
  • range
  • random sample
  • sample - a representative group chosen from the entire population
  • standard deviation - how far away is the data from the mean.
  • standard error
  • statistic - a summary number that describe the sample, ie average or percentage
  • Variance of a probability distribution - sigma squared Var[x]
  • z-score
  • References

    No comments:

    Post a Comment