Grant Details
Grant Number: |
1R03CA252782-01 Interpret this number |
Primary Investigator: |
Li, Yan |
Organization: |
Univ Of Maryland, College Park |
Project Title: |
Improving Population Representativeness of the Inference From Non-Probability Sample Analysis |
Fiscal Year: |
2020 |
Abstract
SUMMARY
The critical role of population-representativeness for estimating disease incidence and prevalence has been
widely accepted in epidemiologic studies. Improving population representativeness of nonprobability samples,
such as samples of volunteers in epidemiologic studies or electronic health records, however, has received little
attention by biostatisticians or epidemiologists. In this project, we propose two innovative “pseudoweight”
construction methods: 1) two-step matching, and 2) calibration, under an adapted exchangeability assumption,
for unbiased estimation of disease incidence and prevalence in the target population. The proposed methods,
combined with machine learning methods for propensity score estimation, will achieve significant bias
reduction, especially when selection into nonprobability samples is driven by complex relationships between
the covariates. We will quantify the bias reduced by the proposed “pseudoweights”, numerically and
empirically, on the estimation of disease incidence and prevalence in the target population. Monte Carlo
simulation studies are designed under varying degrees of departure from the adapted exchangeability
assumption to evaluate the bias of the proposed estimates. The robustness of the proposed estimators against
varying sample sizes, number of clusters in survey, and complexities of the true propensity score modeling will
be investigated in scenarios that differ by levels of non-linearity, non-additivity and correlations between
covariates in the true propensity model. Using data from National Institutes of Health and the American
Association of Retired Persons (NIH-AARP, a nonprobability cohort sample) data and the US National Health
Interview Survey (NHIS, a probability survey sample), the proposed methods will be applied to estimate the
prevalence of self-reported diseases and all-cause or all-cancer mortality rates for people aged 50-71 in the
US. To test our methods, we will purposely select outcome variables that are available in both the NIH-AARP
and the NHIS. Thus, the amount of bias in NIH-AARP estimates corrected by the proposed pseudoweights
can be quantified in practice, assuming the weighted NHIS estimate is true. The proposed methods, although
motivated by the volunteer-based epidemiological studies, have wide applications outside of epidemiology,
such as electronic health records or web surveys. The results from this project can be used by epidemiologists
and health policy makers to improve the understanding of the health-related characteristics in the general
population. Computer software that implements the proposed methods will be made available for public use.
Publications
None