Ann's Code Guidebook: March 2019

Monday, March 25, 2019

Statistics 101

What is Statistics

A branch of mathematics dealing with data collection, organization, analysis, interpretation and presentation.

Probability is a mathematical language to discuss uncertainties and it plays a key role in statistics.

In layman's term, statistics is a toolbox with methods to get answers from data.

Terms

Binomial random variable

A distribution of a sum of the squares of k independent standard normal random variables.
It is a special case of the gamma distribution
Is one of the most commonly used probability distributions in inferential statistics.

Chi-squared distribution with k-degrees of freedom

The random variable from the experiment that has only two possible values or outcomes.

Chi-square analysis

χ2 test is a hypothesis test where the sampling distribution of the test statistic is a chi-squared distribution when the null hypothesis is true.
In simple term, it often means 'Pearson's chi-squared test.
Used to determine whether there is a significant difference between the expected frequencies and the observed frequencies in one or more categories.
It is often constructed from a sum of squared errors or the sample variance.
It assumes the population has independent normally distributed data, which is valid due to central limit theorem.

Confidence interval

Estimate parameters of a population using a sample.
Use the mean x from sample to find a range of values that we can be confident to contain the mean of the population sampled
Lower bound = estimate - margin of error
Upper bound = estimate + margin of error
T-intervals - use it when population standard deviation is unknown and original population normal or sample size >= 30. This formulus use sample standard deviation instead of population standard deviation.
Z-intervals - use it when sample size >= 30 and population standard deviation known, or original population normal with the population standard deviation known.

Gamma distribution

Two parameter family of continuous probability distributions.
Exponential distribution, Erlang distribution, and chi-squared distribution are special cases of the gamma distribution.
Three different parametrizations in common use:

1. With a shape parameter 𝞳 and a scale parameter 𝞡.
2. With a shape parameter 𝞪 = 𝞳 and an inverse scale parameter 𝞫 = 1/𝞡.
3. With a shape parameter 𝞳 and a mean parameter 𝞵 = 𝞳𝞡 = 𝞪/𝞫.

Gamma distribution is the maximum entropy probability distribution for a radom variable X for which E[X] = kθ = α/β is fixed and greater than 0.
E[ln(X)] = ψ(k) + ln(θ) = ψ(α) − ln(β) is fixed (ψ is the digamma function)

Hypothesis testing

Two types of errors in hypothesis testing: I and II.
Test and draw conclusions about the value of a parameter
Power analysis
Tests of proportion
P-value approach

Normal distribution

symmetrical about its mean.
Bell-shaped with a single peak at the center of the distribution.
Arithmetic mean is at the peak and at the center, with half the area above the mean and half under the mean.
It is asymptotic and the curve gets closer to the X-axis but never really touches it.
Mean, median and mode are equal
Curve extends to infinity theoretically
Standard Normal distribution has a mean of 0 and a standard deviation of 1
Z-score or Z-value is the distance between a selected value x, and the population mean mu, divided by the population standard deviation sigma.

z = (x-𝞵) / 𝞼

68.26 % of the area under the normal curve is within one standard deviation of the mean. 𝞵 ± 𝞼
95.44 % of the area under the normal curve is within two standard deviation of the mean. 𝞵 ± 2𝞼
99.74 % of the area under the normal curve is within three standard deviation of the mean. 𝞵 ± 3𝞼

Probability distribution

all possible outcomes of an experiment and the corresponding probability
the sum of the probabilities of the various outcomes is 1.
The probability of a particular outcome is between 0 and 1
The standard deviation of particular probability is in inverse proportion to the sample size

Other basic terms

ANOVA - analysis of variance

degrees of freedom

mean E[x]

median

mode - most prevalent data points in the data set

normalize

outlier

p-value

parameter - any summary number that describes the population, ie average or percentage

population - any large collection of objects of interest.

r-squared - how much the regression function explains the variation in outcomes

random variable - numerical value determined by the outcome of an experiment

range

random sample

sample - a representative group chosen from the entire population

standard deviation - how far away is the data from the mean.

standard error

statistic - a summary number that describe the sample, ie average or percentage

Variance of a probability distribution - sigma squared Var[x]

z-score

References

Confidence Interval Explained

Thursday, March 14, 2019

Python Projects

GAPMINDER visualization

Gapminder.org data

## www.gapminder.org/data
## www.gapminder.org/tools
## Factfulness (2018) Hans Rosling
## Search YouTube: Hans Rosling's 200 Countries, 200 Years, 4 minutes

Example


import re, mailbox, csv
import numpy as np
import pandas as pd
import scipy.stats
from sklearn import datasets
import matplotlib
import matplotlib.pyplot as plt
from IPython import display
from ipywidgets import interact, widgets
# %matplotlib inline
gapminder = pd.read_csv('gapminder.csv')

Wine quality

Wine data

Email Analysis

Your Google Mail Takeout

Sunday, March 10, 2019

Python Scikit-Learn Library

Meet sklearn

scikit-learn.org

scikit-learn tutorial

Scikit-learn user guide

import sklearn as sk
sklearn.__version__
import nose
nosetest sklearn -exe

SkLearn Basics

Compliments and extend scipy

Classification

Regression

clustering

Dimensionality reduction

Example

from sklearn import datasets
iris = datasets.load_iris()
digits = datasets.load_digits()
print(digits.data)
from sklearn import svm
clf = svm.SVC(gamma=0.001, C=100)
clf.fit(digits.data[:-1], digits.target[:-1])
clf.predict(digits.data[-1:])

Example

# Install a pip package in the current Jupyter kernel
import sys
!{sys.executable} -m pip install mglearn

Example

from sklearn.linear_model import LinearRegression
# Training data
X = [[7], [8], [10], [14], [18]]
y = [[8], [9], [13], [17.5], [20]]
# Create and fit the model
model = LinearRegression()
model.fit(X, y)
print 'Predict value: $%.2f' % model.predict([12])[0]

Tuesday, March 5, 2019

Python Decorators

@accepts(int,int)
@classmethod
@decorator
@property
@returns(float)
@staticmethod
@funcattrs(grammar="'@' dotted_name [ '(' [arglist] ')' ]",
               status="experimental", author="BDFL")

@property

class TempC2F:
    def __init__(self, temp = 0):
        self._t = temp

    def to_fahrenheit(self):
        return (self.temperature * 1.8) + 32

    @property
    def temperature(self):
        return self._t

    @temperature.setter
    def temperature(self, v):
        if v < -273:
            raise ValueError("Temperature below -273 is not possible")
        self._t = v

Friday, March 1, 2019

Java Exercises #1

Exercise 1-1

Write a program that take inputs from command line and create a todo list.
Take names from command line and create a list. Print out the sorted list.
Take a list of numbers from input and search if a certain number is there.

 Try different ways and analyze run time.
Take two n digit numbers and use an algorithm better than brute-force to multiply.
Implement Karatsuba algorithm for integer multiplication

Exercise 1-2

Build a binary tree
Take "car" and return how many possible way to convert it to "let", one letter at a time.

Exercise 1-3

Take a list of students name and their score. Print the top 5 scores and their names.

Exercise 1-4

Take a list of pairs of numbers, which indicate pairs of nodes that is connected. Provide two functions:
1. Output the connected clusters.
2. Return true or false if two nodes are connected.

Python Exercises #1

Exercise 1-1

Write a program that take inputs from command line and create a todo list.
Take names from command line and create a list. Print out the sorted list.
Take a list of numbers from input and search if a certain number is there.

       Try different ways and analyze run time.
Take two n digit numbers and use an algorithm better than brute-force to multiply.
Implement Karatsuba algorithm for integer multiplication

Exercise 1-2

Build a binary tree
Take "car" and return how many possible way to convert it to "let", one letter at a time.

Exercise 1-3

Take a list of students name and their score. Print the top 5 scores and their names.

Exercise 1-4

Take a list of pairs of numbers, which indicate pairs of nodes that is connected. Provide two functions:
1. Output the connected clusters.
2. Return true or false if two nodes are connected.

Ann's Code Guidebook

Pages

Monday, March 25, 2019

Statistics 101

Thursday, March 14, 2019

Python Projects

Sunday, March 10, 2019

Python Scikit-Learn Library

Tuesday, March 5, 2019

Python Decorators

Friday, March 1, 2019

Java Exercises #1

Python Exercises #1

Disclaimer