Title: | Estimation in Respondent Driven Samples |
---|---|
Description: | Maximum likelihood estimation in respondent driven samples. |
Authors: | Jonathan Rosenblatt |
Maintainer: | Jonathan Rosenblatt <[email protected]> |
License: | GPL-2 |
Version: | 0.95.4 |
Built: | 2025-02-28 04:16:44 UTC |
Source: | https://github.com/cran/chords |
Estimates population size and degree distribution in respondent driven samples (RDS).
Maximum likelihood estimation of population size using the methodology in the reference.
The fundamental idea is of modeling the sampling as an epidemic process.
See Estimate.b.k
for details.
Jonathan D. Rosenblatt [email protected]
[1] Berchenko, Y., Rosenblatt J.D., and S.D.W. Frost. "Modeling and Analyzing Respondent Driven Sampling as a Counting Process." arXiv:1304.3505
A respondent driven sample of heavy drug users in Curitiba.
data("brazil")
data("brazil")
A data frame with 303 observations on the following 8 variables.
MyUniID
Subject's ID.
NS1
Subject's self reported degree.
refCoupNum
Reference coupon no.
coup1
Supplied coupon.
coup2
Supplied coupon.
coup3
Supplied coupon.
interviewDt
Time of interview. See details.
interviewDt2
Deprecated.
The format of the data is essentially that of the RDS file format as specified in page 7 in the RDS Analysis tool manual: http://www.respondentdrivensampling.org/reports/RDSAT_7.1-Manual_2012-11-25.pdf.
The RDS format has been augmented with the time of interview (interviewDt
variable) required for the methodology in [1].
The interviewDt
variable encodes the time of interview.
For the purpose of calling Estimate.b.k
the scale and origin are imaterial. We thus use an arbitrary efficient encoding which might not adhere to the original scale.
For full details see the Source section.
[1] Salganik, M.J., Fazito, D., Bertoni, N., Abdo, A.H., Mello, M.B., and Bastos, F.I. (2011). "Assessing Network Scale-up Estimates for Groups Most at Risk of HIV/AIDS: Evidence From a Multiple-Method Study of Heavy Drug Users in Curitiba, Brazil." American Journal of Epidemiology, 174(10): 1190-1196.
And http://opr.princeton.edu/archive/nsum/
[1] Berchenko, Y., Rosenblatt J.D., and S.D.W. Frost. "Modeling and Analyzing Respondent Driven Sampling as a Counting Process." arXiv:1304.3505
Estimate population size from respondent driven samples (RDS) using maximum likelihood, and several variation. The underlying idea is that the sample spreads like an epidemic in the target population as described in the reference.
Estimate.b.k(rds.object, type = "mle", jack.control = NULL)
Estimate.b.k(rds.object, type = "mle", jack.control = NULL)
rds.object |
A object of class |
type |
A character vector with the type of estimation. Possible values:
|
jack.control |
A object of class |
As of version 0.95, this function is the main workhorse of the chords
package.
Given an rds-class
object, it will return population size estimates for each degree.
Note that for the rescaling
and parametric
estimators, the input rds-object
is expected to contain some initial estimate in the estimates
slot.
See the reference for a description of the likelihood problem solved. Optimization is performed by noting that likelihood is coordinate-wise convex, thus amounts to a series of line-searches.
An rds-class
object with an updated estimates
slot.
The estiamtes
slot is list
with the following components:
call |
The function call. |
Nk.estimates |
The estimated degree frequencies. |
log.bk.estimates |
The estimated sampling rates for each degree. In log scale. |
convergence |
0 if estimation of |
[1] Berchenko, Y., Rosenblatt D.J., and S.D.W. Frost. "Modeling and Analyzing Respondent Driven Sampling as a Counting Process." arXiv:1304.3505,
initializeRdsObject
,
makeRdsSample
,
getTheta
.
# Preliminaries data(brazil) rds.object2<- initializeRdsObject(brazil) see <- function(x) plot(x$estimates$Nk.estimates, type='h') # Maximum likelihood rds.object <- Estimate.b.k(rds.object = rds.object2 ) see(rds.object) # View estimates: plot(rds.object$estimates$Nk.estimates, type='h') # Population size estimate: sum(rds.object$estimates$Nk.estimates) plot(rds.object$estimates$log.bk.estimates, type='h') ## Recover theta assuming b.k=b_0*k^theta getTheta(rds.object) # How many degrees were imputed?: table(rds.object$estimates$convergence) # Observed degree estimation: rds.object.4 <- Estimate.b.k(rds.object = rds.object, type='observed') see(rds.object.4) # Naive rescaling rds.object.5 <- Estimate.b.k(rds.object = rds.object, type='rescaling') see(rds.object.5) # Parametric rates rds.object.6 <- Estimate.b.k(rds.object = rds.object, type='parametric') see(rds.object.6) jack.control <- makeJackControl(3, 1e1) rds.object.7 <- Estimate.b.k(rds.object = rds.object, type='leave-d-out', jack.control = jack.control) see(rds.object.7) rds.object.8 <- Estimate.b.k(rds.object = rds.object, type='integrated', jack.control = jack.control) see(rds.object.8) rds.object.9 <- Estimate.b.k(rds.object = rds.object, type='jeffreys') see(rds.object.9) ## Not run: ## Simulated data example: dk <- c(2, 1e1) # unique degree classes true.dks <- rep(0,max(dk)); true.dks[dk] <- dk true.Nks <- rep(0,max(dk)); true.Nks[dk] <- 1e3 beta <- 1 #5e-6 theta <- 0.1 true.log.bks <- rep(-Inf, max(dk)) true.log.bks[dk] <- theta*log(beta*dk) sample.length <- 4e2 nsims <- 1e2 simlist <- list() for(i in 1:nsims){ simlist[[i]] <- makeRdsSample( N.k =true.Nks , b.k = exp(true.log.bks), sample.length = sample.length) } # Estimate betas and theta with chords: llvec <- rep(NA,nsims) bklist <- list() for(i in 1:nsims){ # i <- 2 simlist[[i]] <- Estimate.b.k(rds.object = simlist[[i]]) # llvec[i] <- simlist[[i]]$estimates$likelihood bklist[[i]] <- simlist[[i]]$estimates$log.bk.estimates } b1vec <- bklist b2vec <- bklist hist(b1vec) abline(v=true.log.bks[2]) hist(b2vec) abline(v=true.log.bks[10]) beta0vec <- rep(-Inf,nsims) thetavec <- rep(-Inf,nsims) nvec <- rep(-Inf,nsims) converged <- rep(9999,nsims) for(i in 1:nsims){ # i <- 2 nvec[i] <- sum(simlist[[i]]$estimates$Nk.estimates) converged[i] <- sum(simlist[[i]]$estimates$convergence, na.rm=TRUE) # tfit <- getTheta(simlist[[i]]) # beta0vec[i] <- tfit$log.beta_0 # thetavec[i] <- tfit$theta } summary(beta0vec) summary(nvec) # summary(thetavec) # hist(thetavec) # abline(v=theta) hist(nvec) abline(v=sum(true.Nks), col='red') abline(v=median(nvec, na.rm = TRUE), lty=2) table(converged) # Try various re-estimatinons: rds.object2 <- simlist[[which(is.infinite(nvec))[1]]] rds.object <- Estimate.b.k(rds.object = rds.object2 ) see(rds.object) rds.object$estimates$Nk.estimates rds.object.5 <- Estimate.b.k(rds.object = rds.object, type='rescaling') see(rds.object.5) # will not work. less than 2 converging estimates. rds.object.5$estimates$Nk.estimates rds.object.6 <- Estimate.b.k(rds.object = rds.object, type='parametric') see(rds.object.6) # will not work. less than 2 converging estimates. jack.control <- makeJackControl(3, 1e2) rds.object.7 <- Estimate.b.k(rds.object = rds.object, type='leave-d-out', jack.control = jack.control) see(rds.object.7) rds.object.7$estimates$Nk.estimates rds.object.8 <- Estimate.b.k(rds.object = rds.object, type='integrated') see(rds.object.8) rds.object.8$estimates$Nk.estimates rds.object.9 <- Estimate.b.k(rds.object = rds.object, type='jeffreys') see(rds.object.9) rds.object.9$estimates$Nk.estimates ## End(Not run)
# Preliminaries data(brazil) rds.object2<- initializeRdsObject(brazil) see <- function(x) plot(x$estimates$Nk.estimates, type='h') # Maximum likelihood rds.object <- Estimate.b.k(rds.object = rds.object2 ) see(rds.object) # View estimates: plot(rds.object$estimates$Nk.estimates, type='h') # Population size estimate: sum(rds.object$estimates$Nk.estimates) plot(rds.object$estimates$log.bk.estimates, type='h') ## Recover theta assuming b.k=b_0*k^theta getTheta(rds.object) # How many degrees were imputed?: table(rds.object$estimates$convergence) # Observed degree estimation: rds.object.4 <- Estimate.b.k(rds.object = rds.object, type='observed') see(rds.object.4) # Naive rescaling rds.object.5 <- Estimate.b.k(rds.object = rds.object, type='rescaling') see(rds.object.5) # Parametric rates rds.object.6 <- Estimate.b.k(rds.object = rds.object, type='parametric') see(rds.object.6) jack.control <- makeJackControl(3, 1e1) rds.object.7 <- Estimate.b.k(rds.object = rds.object, type='leave-d-out', jack.control = jack.control) see(rds.object.7) rds.object.8 <- Estimate.b.k(rds.object = rds.object, type='integrated', jack.control = jack.control) see(rds.object.8) rds.object.9 <- Estimate.b.k(rds.object = rds.object, type='jeffreys') see(rds.object.9) ## Not run: ## Simulated data example: dk <- c(2, 1e1) # unique degree classes true.dks <- rep(0,max(dk)); true.dks[dk] <- dk true.Nks <- rep(0,max(dk)); true.Nks[dk] <- 1e3 beta <- 1 #5e-6 theta <- 0.1 true.log.bks <- rep(-Inf, max(dk)) true.log.bks[dk] <- theta*log(beta*dk) sample.length <- 4e2 nsims <- 1e2 simlist <- list() for(i in 1:nsims){ simlist[[i]] <- makeRdsSample( N.k =true.Nks , b.k = exp(true.log.bks), sample.length = sample.length) } # Estimate betas and theta with chords: llvec <- rep(NA,nsims) bklist <- list() for(i in 1:nsims){ # i <- 2 simlist[[i]] <- Estimate.b.k(rds.object = simlist[[i]]) # llvec[i] <- simlist[[i]]$estimates$likelihood bklist[[i]] <- simlist[[i]]$estimates$log.bk.estimates } b1vec <- bklist b2vec <- bklist hist(b1vec) abline(v=true.log.bks[2]) hist(b2vec) abline(v=true.log.bks[10]) beta0vec <- rep(-Inf,nsims) thetavec <- rep(-Inf,nsims) nvec <- rep(-Inf,nsims) converged <- rep(9999,nsims) for(i in 1:nsims){ # i <- 2 nvec[i] <- sum(simlist[[i]]$estimates$Nk.estimates) converged[i] <- sum(simlist[[i]]$estimates$convergence, na.rm=TRUE) # tfit <- getTheta(simlist[[i]]) # beta0vec[i] <- tfit$log.beta_0 # thetavec[i] <- tfit$theta } summary(beta0vec) summary(nvec) # summary(thetavec) # hist(thetavec) # abline(v=theta) hist(nvec) abline(v=sum(true.Nks), col='red') abline(v=median(nvec, na.rm = TRUE), lty=2) table(converged) # Try various re-estimatinons: rds.object2 <- simlist[[which(is.infinite(nvec))[1]]] rds.object <- Estimate.b.k(rds.object = rds.object2 ) see(rds.object) rds.object$estimates$Nk.estimates rds.object.5 <- Estimate.b.k(rds.object = rds.object, type='rescaling') see(rds.object.5) # will not work. less than 2 converging estimates. rds.object.5$estimates$Nk.estimates rds.object.6 <- Estimate.b.k(rds.object = rds.object, type='parametric') see(rds.object.6) # will not work. less than 2 converging estimates. jack.control <- makeJackControl(3, 1e2) rds.object.7 <- Estimate.b.k(rds.object = rds.object, type='leave-d-out', jack.control = jack.control) see(rds.object.7) rds.object.7$estimates$Nk.estimates rds.object.8 <- Estimate.b.k(rds.object = rds.object, type='integrated') see(rds.object.8) rds.object.8$estimates$Nk.estimates rds.object.9 <- Estimate.b.k(rds.object = rds.object, type='jeffreys') see(rds.object.9) rds.object.9$estimates$Nk.estimates ## End(Not run)
Estimates the effect of the degree on the rate of sampling. Also known as the "coefficient of discoverability" in the oil-discovery literature [2].
Formally, we estimate and
assuming that
.
getTheta(rds.object, bin=1, robust=TRUE)
getTheta(rds.object, bin=1, robust=TRUE)
rds.object |
A |
bin |
Bin degree counts. See Note. |
robust |
Should |
A list including the following components:
log.beta_0 |
The log of |
theta |
|
If degree counts have been binned by initializeRdsObject
(for variance reduction), the same value has to be supplied to getTheta
for correct estimation.
[1] Berchenko, Yakir, Jonathan Rosenblatt, and Simon D. W. Frost. "Modeling and Analyzing Respondent Driven Sampling as a Counting Process."" arXiv:1304.3505. HTTP://arXiv.org/abs/1304.3505. [2] Bloomfield, P., K.S. Deffeyes, B. Silverman, G.S. Watson, Y. Benjamini, and R.A. Stine. Volume and Area of Oil Fields and Their Impact on Order of Discovery, 1980. http://www.osti.gov/scitech/servlets/purl/6037591.
Estimate.b.k
,
initializeRdsObject
,
makeRdsSample
rds-object
from a data.frame.
Given a data frame with the appropriate variables, initializes a rds-object
with the components required by the Estimate.b.k
function for estimation.
initializeRdsObject(rds.sample, bin=1L, seeds=1L)
initializeRdsObject(rds.sample, bin=1L, seeds=1L)
rds.sample |
A data frame with required columns. See Details. |
bin |
The number of degrees fo bin together. See details. |
seeds |
The number of seed recruiters. See details. |
The essence of the function is in recovering the sampling snowball required by Estimate.b.k
.
The function allows for recruiters to enter and exit the sampling snowball.
The number of seed recruiters is typically not specified in an RDS file.
The seeds
argument is a workaround that allows to specify directly this number.
The rds.sample
object is assumed to be a data frame with the following column names:
MyUniIDan identifier of the sampling unit.[not required]
NS1The reported degree.[required]
refCoupNum The number of the referring coupon.
coup1The number of the 1st supplied coupon. NA if none. [required].
coupXThe number of the Xth supplied coupon. NA if none.[not required]
interviewDtThe time of the interview. In numeric representation from some origin. Ties are not allowed.
See brazil
for a sample data.frame.
If the sample is short, stabilization of degree estimates can be achieved by binning degrees together. This can be done with the bin
argument. Note however that the interpretation of the estimated degree counts is now different as the k'th degree is actually the k'th bin, which is only proportional to . An exception is the function
getTheta
which also accepts a bin
argument for proper estimation of .
A list with the following components.
rds.sampleThe input data frame. After ordering along time of arrival.
I.tThe sampling snowball. A list including the following items:
I.tAn integer of the count of the sampling individuals at the moments of recruitment.
degree.inAn integer with the degree of an added recruiter at the moments of recruitment.
degree.outAn integer with the degree of a removed recruiter at the moment of recruitment.
original.orderingThe order of the arrivals as was inputed in rds.sample$interviewDt
estimatesA placeholder for the future output of Estimate.b.k
[1] Berchenko, Y., Rosenblatt J.D., and S.D.W. Frost. "Modeling and Analyzing Respondent Driven Sampling as a Counting Process." arXiv:1304.3505
Estimate.b.k
,
makeRdsSample
,
brazil
A utility function for using the delete-d
option of the Estimate.b.k
function.
makeJackControl(d, B)
makeJackControl(d, B)
d |
The number of (random) arrivals in the sample to delete. |
B |
The number of deleted samples to generate. |
A list with named control parameters.
Musa, John D., and A. Iannino. "Estimating the Total Number of Software Failures Using an Exponential Model." SIGSOFT Softw. Eng. Notes 16, no. 3 (July 1991): 80-84. doi:10.1145/127099.127123.
Generates a sample from the sampling process assumed in the reference.
Well, actually, only the sufficient statistics required by Estimate.b.k
are returned.
makeRdsSample(N.k, b.k, sample.length)
makeRdsSample(N.k, b.k, sample.length)
N.k |
An integer vector with the population frequency of each degree. |
b.k |
A numeric vector of the sampling rates of each degree. |
sample.length |
The length of the sample. Specified as the number of recruitees before termination. |
An object of class rds-object
suitable for applying Estimate.b.k
.
The simulator does not prodice a whole RDS sample, but rather the sufficient statistics required for applying Estimate.b.k
.
[1] Berchenko, Y., Rosenblatt J.D., and S.D.W. Frost. "Modeling and Analyzing Respondent Driven Sampling as a Counting Process." arXiv:1304.3505
# Generate data: true.Nks <- rep(0,100); true.Nks[c(2,100)] <- 1000 theta <- 1e-1 true.log.bks <- rep(-Inf, 100);true.log.bks[c(2,100)] <- theta*log(c(2,100)) sample.length <- 1000L rds.simulated.object <- makeRdsSample( N.k =true.Nks , b.k = exp(true.log.bks), sample.length = sample.length) # Estimate: Estimate.b.k(rds.object = rds.simulated.object ) chords:::compareNkEstimate(rds.simulated.object$estimates$Nk.estimates, true.Nks)
# Generate data: true.Nks <- rep(0,100); true.Nks[c(2,100)] <- 1000 theta <- 1e-1 true.log.bks <- rep(-Inf, 100);true.log.bks[c(2,100)] <- theta*log(c(2,100)) sample.length <- 1000L rds.simulated.object <- makeRdsSample( N.k =true.Nks , b.k = exp(true.log.bks), sample.length = sample.length) # Estimate: Estimate.b.k(rds.object = rds.simulated.object ) chords:::compareNkEstimate(rds.simulated.object$estimates$Nk.estimates, true.Nks)
Smoothes estimated by assuming that
.
thetaSmoothingNks(rds.object, bin=1)
thetaSmoothingNks(rds.object, bin=1)
rds.object |
A |
bin |
Number of degrees to bin together for estimation. |
A numeric vector of smoothed values.
Jonathan D. Rosenblatt [email protected]
# See estimate.b.k()
# See estimate.b.k()