SSiz Prop

Content Disclaimer
Copyright @2020.
All Rights Reserved.

StatsToDo: Sample Size to Determine a Proportion

Links : Home Index (Subjects) Contact StatsToDo

Explanations and References Sample Size Tables Javascript Programs

This sub-panel provides 2 programs, to estimate sample size required to find a proportion at the planning stage, and to estimate the error for confidence intervals when the data is available.

Program 1: Sample Size for Proportion

  Table of 3 columns   Each row a calculation
    - Col 1 = % confidence
    - Col 2 = Expected proportion (0 to 1)
    - Col 3 = Tolerable error
    - e.g.: 95 0.4 0.1 means sample size for 95% confidence of 0.4±0.1

Program 2: Error Estimation for Proportion

  Table of 3 columns   Each row a calculation
    - Col 1 = % confidence
    - Col 2 = Sample size of data
    - Col 3 = Proportion found in data
    - e.g.: 95 24 0.4 means estimate error for 95% confidence interval (p±error),
    when sample sise is 24 observations and proportion (p) found is 0.4

R Codes

This sub-panel presents the calculations for sample size and error for proportions, in R Codes.

The algorithms are essentially the same as that in the Javascript program, with minor alterations to comply with the format of R programming.

The algorithm is based on the following reference

Machin D, Campbell M, Fayers, P, Pinol A (1997) Sample Size Tables for Clinical Studies. Second Ed. Blackwell Science IBSN 0-86542-870-0 p. 135 The codes are divided into 3 sections.

Section 1 contains all the supportive subroutines that are essential for both estimations of sample size and error, and the calculations for sample size and error based on normal distribution (large sample size).
Section 2 contains the main subroutines for sample size and errors based on the binomial distribution (small sample size)
Section 3 the main programs with I/O interfaces

Section 1. Supportive functions used by all subsequent functions

Section 1.a. global array of log(factorial number) for iterative binomial coefficient.

The arLogFact is an array of log factorial number. This is created just once so that repeated testing of binomial coefficients does not require prolonged and repeated calculation of factorial numbers

arLogFact <- vector()

MakeLogFactArray <- function(n)  # create a vector of log(Factorils)
{
  arLogFact <<- vector()  # clears array
  x = 0
  arLogFact <<- append(arLogFact,x)
  for(i in 1:n)
  {
    x = x + log(i)
    arLogFact <<- append(arLogFact,x)
  }
}

Section 1.b. functions for sample size and error using normal distribution


PtoZ <- function(p) # z value from probability
{
  return (-qnorm(p))
}

ZtoP <- function(z)  # probability of z
{
  return (pnorm(-z))
}

SampleSizeNorm <- function(cf, prop, er) # returns sample size for infinite population large sample
{                                        # cf=percent confidence, prop=expected proportion, er=tolerable error
  za = PtoZ((1.0 -  cf / 100.0) / 2.0)
  return (ceiling( prop * (1.0 -  prop) *  za *  za / ( er *  er)))
}

ErrorNorm <- function(cf, n, prop) # return confidence interval for infinite population large sample
{                                  # cf=percent confidence, n=sample size, prop=proportion found
  za = PtoZ((1.0 -  cf / 100.0) / 2.0)
  return  (za * sqrt( prop * (1 -  prop) /  n))
}

Section 2. Functions for calculating error and sample size usingmbinomial distribution

Section 2.a. binomial coefficient and probability

LogBinomCoeff <- function(n, k) # returns the logarithm of binomial coefficient of n and k
{
  return (arLogFact[n + 1] - arLogFact[k + 1] - arLogFact[n - k + 1])
}

p_bin <- function(p,n,k) # probability of observing k positive cases in a sample of n, given the reference probability is p
{
  return (exp(LogBinomCoeff(n,k) + log(p) * k + log(1-p) * (n-k)))
}

Section 2b. Error using binomial distribution

ErrorBinom <- function(c, n, prop, typ) # returns confidence interval for the low end side (formula 6.5)
{                                    # typ: 0= low side only, 1= high side ,  = both low and high
  alpha = (1.0 -  c / 100.0) / 2.0   # alpha in formula 6.5
  nPos = prop * n
  i = 0
  p = p_bin(prop, n, i)
  oldp = p
  while(p<alpha && i<nPos)         # formula 6.5
  {
    i = i + 1
    p = p + p_bin(prop, n, i)
    if(p<=alpha) oldp = p
  }
  pp = i / n
  err = prop - pp
  pp = i / n
  errLow = prop - pp
  if(typ==0) return (errLow)     # returns error on low side
  
  alpha = 1 - alpha
  while(p<alpha)                 # formula 6.5
  {
    i = i + 1
    p = p + p_bin(prop, n, i)
  }
  pp = i / n
  errHigh = pp - prop
  if(typ==1) return (errHigh)
  return (c(errLow, errHigh))
}

Section 2c. Sample size using binomial distribution

SampleSizeBinomLow <- function (c, prop, ci)  # get sample size low side binomial
{                        
  ssiz = SampleSizeNorm(c, prop, ci);  #initial calculation using normal distribution to get high value
  ussiz = ssiz * 2
  lssiz = round(ssiz / 2)
  er = ErrorBinom(c, ssiz, prop, 0)
  while(abs(ussiz-lssiz)>3)
  {
    if(er<ci){ussiz = ssiz} else {lssiz = ssiz}
    ssiz = round((ussiz+lssiz)/2)
    er = ErrorBinom(c, ssiz, prop, 0) 
  }
  return (ceiling((ussiz + lssiz) / 2))
}

SampleSizBinomHigh <- function(c, prop, ci)  # get sample size higher side binomial
{                   
  ssiz = SampleSizeNorm(c, prop, ci) #initial calculation using normal distribution to get high value
  ussiz = ssiz * 2
  lssiz = round(ssiz / 2)
  er = ErrorBinom(c, ssiz, prop, 1)
  while(abs(ussiz-lssiz)>3)
  {
    if(er>ci){lssiz = ssiz} else {ussiz = ssiz}
    ssiz = round((ussiz+lssiz)/2);
    er = ErrorBinom( c, ssiz, prop, 1) 
  }
  return (ceiling((ussiz + lssiz) / 2))
}

Section 3. Main programs with data I/O

Section 3.1. Sample Size

maxSSiz = 1000 # set default minimu, sample size

txt = ("
Cf  Prop  Err
90   0.4  0.1
95   0.2  0.05
99   0.1  0.01
")                       
df <- read.table(textConnection(txt),header=TRUE)      
#df              # optional display of input data frame

# extract columns as vectors
arCf <- df$Cf           # array of % confidence
arProp <- df$Prop       # array of expected proportions
arErr <- df$Err         # array of tolerable error

# Create result vector
arSSNorm <- vector()    # array of sample size normal distribution
arSSBinLow <- vector()  # array of sample size for binomial distribution if error on the lower (towards 0) side
arSSBinHigh <- vector() # array of sample size for binomial distribution if error on the higher (towards 1) side

for(i in 1:nrow(df))    # First run, calculate sample size normal distribution for each row of data
{
  cf = arCf[i]          # % confidence
  prop = arProp[i]      # expected proportion
  er = arErr[i]         # tolerable error
  sNorm = SampleSizeNorm(cf,prop,er)  # sample size normal distribution
  arSSNorm <- append(arSSNorm,sNorm)  # add to array
  if(sNorm>maxSSiz) maxSSiz = sNorm
}  
maxSSiz = maxSSiz * 2
MakeLogFactArray(maxSSiz)  # create a vector of log(Factorils)
for(i in 1:nrow(df))    # Second run, calculate sample size binomial distribution for each row of data
{
  cf = arCf[i]          # % confidence
  prop = arProp[i]      # expected proportion
  er = arErr[i]         # tolerable error
  sNorm = arSSNorm[i]
  if((prop-er)<0) # sample size > that defined at beginning of program
  {                                # or error overlaps the lower error limit 0 
    arSSBinLow <- append(arSSBinLow,"*")   
  }
  else
  {
    arSSBinLow <- append(arSSBinLow,SSizBinomLow(cf, prop, er))  # append sample size to result array
  }
  if((prop+er)>1) # sample size > that defined at beginning of program
  {                                # or error overlaps the higher limit 1 
    arSSBinHigh <- append(arSSBinHigh,"*")
  }
  else
  {
    arSSBinHigh <- append(arSSBinHigh,SSizBinomHigh(cf, prop, er))  # append sample size to result array
  }
}

# Incorporate results arrays into original data frame
df$SSNorm <-arSSNorm
df$SSBinLow <-arSSBinLow
df$SSBinHigh <-arSSBinHigh
# Display data and results
df

The results are as follows

> df
  Cf Prop  Err SSNorm SSBinLow SSBinHigh
1 90  0.4 0.10     65       61        67
2 95  0.2 0.05    246      237       255
3 99  0.1 0.01   5972     5785      6064

Interpreting the results

Input data
- Cf=% confidence
- Prop=expected proportion
- Err=tolerable error
Output results
- SSNorm=sample size normal distrin=bution
- SSBinLow=sample size binomial distribution if Err is on the lower side (towards 0)
- SSBinHigh=sample size binomial distribution if Err is on the higher side (towards 1)
First row. Sample size required are
- For 90% confidence of 0.4±0.1, assuming normal distribution, requires sample size of 65
- For 90% confidence of 0.4-0.1, assuming binomial distribution, requires sample size of 61
- For 90% confidence of 0.4+0.1, assuming binomial distribution, requires sample size of 67

Section 3.b. Estimating Error

maxSSiz = 1000 # set default minimu, sample size

txt = ("
Cf  SSiz Prop
90    65  0.4
90    61  0.4
95    67  0.2
95   237  0.2
99  5972  0.1
")                       
df <- read.table(textConnection(txt),header=TRUE)      
#df           # optional display of data frame

# extract columns as vectors
arCf <- df$Cf           # array of % confidence
arSSiz <- df$SSiz       # array of sample size of data
arProp <- df$Prop       # array of proportions found 

# Create result vector
arErNorm <- vector()     # array of error, normal distribution
arErBinLow <- vector()   # array of error, binomial distribution, lower (towards 0) side
arErBinHigh <- vector()  # array of error, binomial distribution, upper (towards 1) side

for(i in 1:nrow(df))     # first run error normal distribution and reset max sample size
{
  cf = arCf[i]           # % confidence
  ssiz = arSSiz[i]       # sampkle size
  prop = arProp[i]       # proportion
  arErNorm <- append(arErNorm,ErrorNorm(cf,ssiz,prop)) # append error normal distribution to result array
  if(ssiz>maxSSiz)                           # adjust maximum sample size
  {
    maxSSiz = ssiz
  }
}
maxSSiz = maxSSiz * 2;
MakeLogFactArray(maxSSiz)  # set up array of log(Factorial) numbers
for(i in 1:nrow(df))     # Second run error binomial distribution
{
  cf = arCf[i]           # % confidence
  ssiz = arSSiz[i]       # sampkle size
  prop = arProp[i]       # proportion
  arBinErr <- ErrorBinom(cf, ssiz, prop, 2)
  arErBinLow <- append(arErBinLow,arBinErr[1]) # append error (lower)
  arErBinHigh <- append(arErBinHigh,arBinErr[2]) # append error upper
}
# Incorporate results to original data frame
df$ErNorm <-arErNorm
df$ErBinLow <-arErBinLow
df$ErBinHigh <-arErBinHigh
# Display data and results
df

The results are as follows

> df
  Cf SSiz Prop      ErNorm    ErBinLow  ErBinHigh
1 90   65  0.4 0.099948481 0.092307692 0.10769231
2 90   61  0.4 0.103173452 0.104918033 0.10819672
3 95   67  0.2 0.095779084 0.095522388 0.09850746
4 95  237  0.2 0.050925337 0.048101266 0.05316456
5 99 5972  0.1 0.009999503 0.009912927 0.01018084

Interpreting the results

Input data
- Cf=% confidence
- SSiz=sample size of data collected
- prop=proportion found
Output results
- ErNorm=error normal distrin=bution
- ErBinLow=Error binomial distribution on the lower side (towards 0)
- ErBinHigh=Error binomial distribution on the higher side (towards 1)
First row. With a sample size of 65 cases and proportion found 0.4 (results truncated to 2 decimal points precision)
- The 90% confidence interval, based on normal distribution is 0.4±0.10, from 0.3 to 0.5
- The 90% confidence interval, based on binomial distribution is from 0.4-0.09 to 0.4+0.11, from 0.31 to 0.51