Skip to main content
An official website of the United States government
Grant Details

Grant Number: 1U01CA274576-01A1 Interpret this number
Primary Investigator: Long, Qi
Organization: University Of Pennsylvania
Project Title: Robust Privacy Preserving Distributed Analysis Platform for Cancer Research: Addressing Data Bias and Disparities
Fiscal Year: 2023


Project Summary Privacy-preserving distributed analysis has gained increasing interests in the broad biomedical research community in recent years, as it can a) eliminate the need to create, maintain, and secure access to central data repositories, b) minimize the need to disclose protected health information outside the data-owning entity, and c) mitigate many security, proprietary, privacy and other concerns. As such, it offers great promises in lowering regulatory and other hurdles for collaboration across multiple institutions and enhancing the public trust in biomedical research. Equally important, analysis of health data from multiple institutions across the US would yield more robust and generalizable findings. This is particularly relevant in cancer disparities research as the sample size for minority groups can be very small from one institution. However, there remain significant methodological gaps in the current state-of-the-art for privacy-preserving distributed analysis. Most notably, missing data present significant challenges, as they are ubiquitous in biomedical data including, but not limited to, electronic health records (EHR). It is well known that missing data is a major source of bias in EHR. For example, patients from minority groups and those who have less access to private insurance tend to have more missing data in their EHR. Biased data as a result of missing data are known to yield unfair statistical and machine learning models, which in turn can perpetuate and exacerbate health inequities and disparities. There has been no work on principled approaches for properly handling missing data in distributed analysis beyond our recent works. In addition, it is well-known that distributed analysis is still at risk of revealing important individual-level information and lacks rigorous guarantee in the sense of differential privacy, the prevailing notion and metric for privacy protection. To address these significant limitations, we propose three specific aims. In Aim 1, we will refine and develop state-of-the-art imputation methods for handling missing data in distributed analysis and develop advanced functionalities for enhanced privacy protection through differential privacy control and homomorphic encryption. Building on the methods developed in Aim 1, we will develop an open-source and open-access distributed analysis platform that includes a robust system architecture and user-friendly GUI in Aim 2. We will assess and validate our distributed analysis platform using real-world use cases in cancer disparities research in Aim 3. With the enhanced privacy protection, our proposed distributed analysis platform will have the potential to further enhance public trust and lowerhurdles for collaboration across multiple institutions in cancer research. As such, our platform will enable researchers to use more information and less biased data in cancer research, enhance the validity, robustness and generalizability of research findings, and offer research substantial benefits in areas including, but not limited to, cancer disparities and informatics practice.


Deep learning to predict rapid progression of Alzheimer's disease from pooled clinical trials: A retrospective study.
Authors: Ma X. , Shyer M. , Harris K. , Wang D. , Hsu Y.C. , Farrell C. , Goodwin N. , Anjum S. , Bukhbinder A.S. , Dean S. , et al. .
Source: PLOS digital health, 2024 Apr; 3(4), p. e0000479.
EPub date: 2024-04-10.
PMID: 38598464
Related Citations

SPeC: A Soft Prompt-Based Calibration on Performance Variability of Large Language Model in Clinical Notes Summarization.
Authors: Chuang Y.N. , Tang R. , Jiang X. , Hu X. .
Source: Journal of biomedical informatics, 2024 Mar; 151, p. 104606.
EPub date: 2024-02-05.
PMID: 38325698
Related Citations

Fair Canonical Correlation Analysis.
Authors: Zhou Z. , Tarzanagh D.A. , Hou B. , Tong B. , Xu J. , Feng Y. , Long Q. , Shen L. .
Source: Advances in neural information processing systems, 2023 Dec; 36, p. 3675-3705.
PMID: 38665178
Related Citations

Predicting multiple sclerosis severity with multimodal deep neural networks.
Authors: Zhang K. , Lincoln J.A. , Jiang X. , Bernstam E.V. , Shams S. .
Source: BMC medical informatics and decision making, 2023-11-09; 23(1), p. 255.
EPub date: 2023-11-09.
PMID: 37946182
Related Citations

Characterizing Treatment Non-responders vs. Responders in Completed Alzheimer's Disease Clinical Trials.
Authors: Wang D. , Ling Y. , Harris K. , Schulz P.E. , Jiang X. , Kim Y. .
Source: medRxiv : the preprint server for health sciences, 2023-10-30; , .
EPub date: 2023-10-30.
PMID: 37961216
Related Citations

Disentangling accelerated cognitive decline from the normal aging process and unraveling its genetic components: A neuroimaging-based deep learning approach.
Authors: Dai Y. , Yu-Chun H. , Fernandes B.S. , Zhang K. , Xiaoyang L. , Enduru N. , Liu A. , Manuel A.M. , Jiang X. , Zhao Z. .
Source: Research square, 2023-09-08; , .
EPub date: 2023-09-08.
PMID: 37720047
Related Citations

Fairness-Aware Class Imbalanced Learning on Multiple Subgroups.
Authors: Tarzanagh D.A. , Hou B. , Tong B. , Long Q. , Shen L. .
Source: Proceedings of machine learning research, 2023 Aug; 216, p. 2123-2133.
PMID: 38601022
Related Citations

Leveraging informative missing data to learn about acute respiratory distress syndrome and mortality in long-term hospitalized COVID-19 patients throughout the years of the pandemic.
Authors: Getzen E. , Tan A.L. , Brat G. , Omenn G.S. , Strasser Z. , Consortium for Clinical Characterization of COVID-19 by EHR (4CE) (Collaborative Group/Consortium) , Long Q. , Holmes J.H. , Mowery D. .
Source: AMIA ... Annual Symposium proceedings. AMIA Symposium, 2023; 2023, p. 942-950.
EPub date: 2024-01-11.
PMID: 38222425
Related Citations

Sensitive Data Detection with High-Throughput Machine Learning Models in Electrical Health Records.
Authors: Zhang K. , Jiang X. .
Source: AMIA ... Annual Symposium proceedings. AMIA Symposium, 2023; 2023, p. 814-823.
EPub date: 2024-01-11.
PMID: 38222389
Related Citations

Large Language Models for Healthcare Data Augmentation: An Example on Patient-Trial Matching.
Authors: Yuan J. , Tang R. , Jiang X. , Hu X. .
Source: AMIA ... Annual Symposium proceedings. AMIA Symposium, 2023; 2023, p. 1324-1333.
EPub date: 2024-01-11.
PMID: 38222339
Related Citations

Back to Top