Monday, March 25, 2019

Statistics 101

What is Statistics
A branch of mathematics dealing with data collection, organization, analysis, interpretation and presentation.

Probability is a mathematical language to discuss uncertainties and it plays a key role in statistics.
In layman's term, statistics is a toolbox with methods to get answers from data.
Terms
Binomial random variable
  • A distribution of a sum of the squares of k independent standard normal random variables.
  • It is a special case of the gamma distribution
  • Is one of the most commonly used probability distributions in inferential statistics.
Chi-squared distribution with k-degrees of freedom
  • The random variable from the experiment that has only two possible values or outcomes.
Chi-square analysis
  • χ2 test is a hypothesis test where the sampling distribution of the test statistic is a chi-squared distribution when the null hypothesis is true.
  • In simple term, it often means 'Pearson's chi-squared test.
  • Used to determine whether there is a significant difference between the expected frequencies and the observed frequencies in one or more categories.
  • It is often constructed from a sum of squared errors or the sample variance.
  • It assumes the population has independent normally distributed data, which is valid due to central limit theorem.
Confidence interval
  • Estimate parameters of a population using a sample.
  • Use the mean x from sample to find a range of values that we can be confident to contain the mean of the population sampled
  • Lower bound = estimate - margin of error
  • Upper bound = estimate + margin of error
  • T-intervals - use it when population standard deviation is unknown and original population normal or sample size >= 30. This formulus use sample standard deviation instead of population standard deviation.
  • Z-intervals - use it when sample size >= 30 and population standard deviation known, or original population normal with the population standard deviation known.
Gamma distribution
  • Two parameter family of continuous probability distributions.
  • Exponential distribution, Erlang distribution, and chi-squared distribution are special cases of the gamma distribution.
  • Three different parametrizations in common use:
  • 1. With a shape parameter 𝞳 and a scale parameter 𝞡.
    2. With a shape parameter 𝞪 = 𝞳 and an inverse scale parameter 𝞫 = 1/𝞡.
    3. With a shape parameter 𝞳 and a mean parameter 𝞵 = 𝞳𝞡 = 𝞪/𝞫.
    
  • Gamma distribution is the maximum entropy probability distribution for a radom variable X for which E[X] = kθ = α/β is fixed and greater than 0.
  • E[ln(X)] = ψ(k) + ln(θ) = ψ(α) − ln(β) is fixed (ψ is the digamma function)
Hypothesis testing
  • Two types of errors in hypothesis testing: I and II.
  • Test and draw conclusions about the value of a parameter
  • Power analysis
  • Tests of proportion
  • P-value approach
Normal distribution
  • symmetrical about its mean.
  • Bell-shaped with a single peak at the center of the distribution.
  • Arithmetic mean is at the peak and at the center, with half the area above the mean and half under the mean.
  • It is asymptotic and the curve gets closer to the X-axis but never really touches it.
  • Mean, median and mode are equal
  • Curve extends to infinity theoretically
  • Standard Normal distribution has a mean of 0 and a standard deviation of 1
  • Z-score or Z-value is the distance between a selected value x, and the population mean mu, divided by the population standard deviation sigma.
  • z = (x-𝞵) / 𝞼
    
  • 68.26 % of the area under the normal curve is within one standard deviation of the mean. 𝞵 ± 𝞼
  • 95.44 % of the area under the normal curve is within two standard deviation of the mean. 𝞵 ± 2𝞼
  • 99.74 % of the area under the normal curve is within three standard deviation of the mean. 𝞵 ± 3𝞼
Probability distribution
  • all possible outcomes of an experiment and the corresponding probability
  • the sum of the probabilities of the various outcomes is 1.
  • The probability of a particular outcome is between 0 and 1
  • The standard deviation of particular probability is in inverse proportion to the sample size
Other basic terms
  • ANOVA - analysis of variance
  • degrees of freedom
  • mean E[x]
  • median
  • mode - most prevalent data points in the data set
  • normalize
  • outlier
  • p-value
  • parameter - any summary number that describes the population, ie average or percentage
  • population - any large collection of objects of interest.
  • r-squared - how much the regression function explains the variation in outcomes
  • random variable - numerical value determined by the outcome of an experiment
  • range
  • random sample
  • sample - a representative group chosen from the entire population
  • standard deviation - how far away is the data from the mean.
  • standard error
  • statistic - a summary number that describe the sample, ie average or percentage
  • Variance of a probability distribution - sigma squared Var[x]
  • z-score
  • References

    Thursday, March 14, 2019

    Python Projects

    GAPMINDER visualization
  • Gapminder.org data
  • ## www.gapminder.org/data
    ## www.gapminder.org/tools
    ## Factfulness (2018) Hans Rosling
    ## Search YouTube: Hans Rosling's 200 Countries, 200 Years, 4 minutes
    
    Example
    
    import re, mailbox, csv
    import numpy as np
    import pandas as pd
    import scipy.stats
    from sklearn import datasets
    import matplotlib
    import matplotlib.pyplot as plt
    from IPython import display
    from ipywidgets import interact, widgets
    # %matplotlib inline
    gapminder = pd.read_csv('gapminder.csv')
    
    
    Wine quality
    Email Analysis

    Sunday, March 10, 2019

    Python Scikit-Learn Library

    Meet sklearn
  • scikit-learn.org
  • scikit-learn tutorial
  • Scikit-learn user guide
  • import sklearn as sk
    sklearn.__version__
    import nose
    nosetest sklearn -exe
    
    SkLearn Basics
  • Compliments and extend scipy
  • Classification
  • Regression
  • clustering
  • Dimensionality reduction
  • Example
    from sklearn import datasets
    iris = datasets.load_iris()
    digits = datasets.load_digits()
    print(digits.data)
    from sklearn import svm
    clf = svm.SVC(gamma=0.001, C=100)
    clf.fit(digits.data[:-1], digits.target[:-1])
    clf.predict(digits.data[-1:])
    
    Example
    # Install a pip package in the current Jupyter kernel
    import sys
    !{sys.executable} -m pip install mglearn
    
    Example
    from sklearn.linear_model import LinearRegression
    # Training data
    X = [[7], [8], [10], [14], [18]]
    y = [[8], [9], [13], [17.5], [20]]
    # Create and fit the model
    model = LinearRegression()
    model.fit(X, y)
    print 'Predict value: $%.2f' % model.predict([12])[0]
    

    Tuesday, March 5, 2019

    Python Decorators

    Python Decorators
  • Python Decorators
  • @accepts(int,int)
    @classmethod
    @decorator
    @property
    @returns(float)
    @staticmethod
    @funcattrs(grammar="'@' dotted_name [ '(' [arglist] ')' ]",
                   status="experimental", author="BDFL")
    
    @property
    class TempC2F:
        def __init__(self, temp = 0):
            self._t = temp
    
        def to_fahrenheit(self):
            return (self.temperature * 1.8) + 32
    
        @property
        def temperature(self):
            return self._t
    
        @temperature.setter
        def temperature(self, v):
            if v < -273:
                raise ValueError("Temperature below -273 is not possible")
            self._t = v
    

    Friday, March 1, 2019

    Java Exercises #1

    Exercise 1-1
    1. Write a program that take inputs from command line and create a todo list.
    2. Take names from command line and create a list. Print out the sorted list.
    3. Take a list of numbers from input and search if a certain number is there.
       Try different ways and analyze run time.
    4. Take two n digit numbers and use an algorithm better than brute-force to multiply.
    5. Implement Karatsuba algorithm for integer multiplication
    Exercise 1-2
    1. Build a binary tree
    2. Take "car" and return how many possible way to convert it to "let", one letter at a time.
    Exercise 1-3
    Take a list of students name and their score. Print the top 5 scores and their names.
    Exercise 1-4
    Take a list of pairs of numbers, which indicate pairs of nodes that is connected. Provide two functions:
    1. Output the connected clusters.
    2. Return true or false if two nodes are connected.

    Python Exercises #1

    Exercise 1-1
    1. Write a program that take inputs from command line and create a todo list.
    2. Take names from command line and create a list. Print out the sorted list.
    3. Take a list of numbers from input and search if a certain number is there.
             Try different ways and analyze run time.
    4. Take two n digit numbers and use an algorithm better than brute-force to multiply.
    5. Implement Karatsuba algorithm for integer multiplication
    Exercise 1-2
    1. Build a binary tree
    2. Take "car" and return how many possible way to convert it to "let", one letter at a time.
    Exercise 1-3
    Take a list of students name and their score. Print the top 5 scores and their names.
    Exercise 1-4
    Take a list of pairs of numbers, which indicate pairs of nodes that is connected. Provide two functions:
    1. Output the connected clusters.
    2. Return true or false if two nodes are connected.