On sampling and testing

You have some data (a sample, which you have built with sweat and blood), you have an hypothesis and the data is meant for you to prove or disprove it.

How to sample

In general, you'd have a sample of a population, otherwise (if you had the full population) you would know everything about it. You have to test precisely because you only have partial information. But is your sample always good/representative? Sampling randomly (that is, uniformly) isn't always the best idea.

Stratified sampling

Stratified sampling is a way to sample data from a population, especially in cases when said population isn't homogeneous so sampling "randomly" (all points extracted with the same probability) risks not reflecting the lack of homogeneity.
Stratification is the process of dividing the population into homogeneous subgroups before sampling (strata), so that each element only belongs to one stratum, and then random sampling is applied on each stratum.

Proportional allocation

In this strategy, you use the sampling fraction for each stratum: if
is the desired sample size, we use
ns=nNsNn_s = \frac{n N_s}{N}
, where
is the total number of items and
the number of items in the stratum as the size fraction of the stratum.

Optimal allocation

In this strategy, the standard deviation of the distribution in each stratum gets taken into account, so that the size fraction of the stratum is
ns=nNsσsk=1SNkσkn_s = \frac{n N_s \sigma_s}{\sum{k=1}^S N_k \sigma_k}
. What this means is that strata are weighted with their variability.


The null hypothesis

The null hypothesis is the one checked against in the statistical test, that is, the one we are checking if we can disregard; it is basically assumed to be true until some evidence proves the contrary. Typically, it is indicated as

Hypothesis testing and types of error

Type I error and Type II error

It is a false positive, that is, occurs when
is erroneously rejected.
It is a false negative, that is, occurs when
is not rejected when it should be.
Here's a handy table:
Null hypothesis and types of errors
Type I
true positive
Don't reject
true negative
Type II