Skip to main content

COVID-19 is an emerging, rapidly evolving situation.

What people with cancer should know:

Guidance for cancer researchers:

Get the latest public health information from CDC:

Get the latest research information from NIH:

Grant Details

Grant Number: 1R03CA252782-01 Interpret this number
Primary Investigator: Li, Yan
Organization: Univ Of Maryland, College Park
Project Title: Improving Population Representativeness of the Inference From Non-Probability Sample Analysis
Fiscal Year: 2020


SUMMARY The critical role of population-representativeness for estimating disease incidence and prevalence has been widely accepted in epidemiologic studies. Improving population representativeness of nonprobability samples, such as samples of volunteers in epidemiologic studies or electronic health records, however, has received little attention by biostatisticians or epidemiologists. In this project, we propose two innovative “pseudoweight” construction methods: 1) two-step matching, and 2) calibration, under an adapted exchangeability assumption, for unbiased estimation of disease incidence and prevalence in the target population. The proposed methods, combined with machine learning methods for propensity score estimation, will achieve significant bias reduction, especially when selection into nonprobability samples is driven by complex relationships between the covariates. We will quantify the bias reduced by the proposed “pseudoweights”, numerically and empirically, on the estimation of disease incidence and prevalence in the target population. Monte Carlo simulation studies are designed under varying degrees of departure from the adapted exchangeability assumption to evaluate the bias of the proposed estimates. The robustness of the proposed estimators against varying sample sizes, number of clusters in survey, and complexities of the true propensity score modeling will be investigated in scenarios that differ by levels of non-linearity, non-additivity and correlations between covariates in the true propensity model. Using data from National Institutes of Health and the American Association of Retired Persons (NIH-AARP, a nonprobability cohort sample) data and the US National Health Interview Survey (NHIS, a probability survey sample), the proposed methods will be applied to estimate the prevalence of self-reported diseases and all-cause or all-cancer mortality rates for people aged 50-71 in the US. To test our methods, we will purposely select outcome variables that are available in both the NIH-AARP and the NHIS. Thus, the amount of bias in NIH-AARP estimates corrected by the proposed pseudoweights can be quantified in practice, assuming the weighted NHIS estimate is true. The proposed methods, although motivated by the volunteer-based epidemiological studies, have wide applications outside of epidemiology, such as electronic health records or web surveys. The results from this project can be used by epidemiologists and health policy makers to improve the understanding of the health-related characteristics in the general population. Computer software that implements the proposed methods will be made available for public use.



Back to Top