Grant Details
Grant Number: |
5U01CA274576-02 Interpret this number |
Primary Investigator: |
Long, Qi |
Organization: |
University Of Pennsylvania |
Project Title: |
Robust Privacy Preserving Distributed Analysis Platform for Cancer Research: Addressing Data Bias and Disparities |
Fiscal Year: |
2024 |
Abstract
Project Summary
Privacy-preserving distributed analysis has gained increasing interests in the broad biomedical research
community in recent years, as it can a) eliminate the need to create, maintain, and secure access to central
data repositories, b) minimize the need to disclose protected health information outside the data-owning entity,
and c) mitigate many security, proprietary, privacy and other concerns. As such, it offers great promises in
lowering regulatory and other hurdles for collaboration across multiple institutions and enhancing the public
trust in biomedical research. Equally important, analysis of health data from multiple institutions across the US
would yield more robust and generalizable findings. This is particularly relevant in cancer disparities research
as the sample size for minority groups can be very small from one institution. However, there remain significant
methodological gaps in the current state-of-the-art for privacy-preserving distributed analysis. Most notably,
missing data present significant challenges, as they are ubiquitous in biomedical data including, but not limited
to, electronic health records (EHR). It is well known that missing data is a major source of bias in EHR. For
example, patients from minority groups and those who have less access to private insurance tend to have
more missing data in their EHR. Biased data as a result of missing data are known to yield unfair statistical and
machine learning models, which in turn can perpetuate and exacerbate health inequities and disparities. There
has been no work on principled approaches for properly handling missing data in distributed analysis beyond
our recent works. In addition, it is well-known that distributed analysis is still at risk of revealing important
individual-level information and lacks rigorous guarantee in the sense of differential privacy, the prevailing
notion and metric for privacy protection. To address these significant limitations, we propose three specific
aims. In Aim 1, we will refine and develop state-of-the-art imputation methods for handling missing data in
distributed analysis and develop advanced functionalities for enhanced privacy protection through differential
privacy control and homomorphic encryption. Building on the methods developed in Aim 1, we will develop an
open-source and open-access distributed analysis platform that includes a robust system architecture and
user-friendly GUI in Aim 2. We will assess and validate our distributed analysis platform using real-world use
cases in cancer disparities research in Aim 3. With the enhanced privacy protection, our proposed distributed
analysis platform will have the potential to further enhance public trust and lowerhurdles for collaboration
across
multiple
institutions
in cancer research. As such, our platform will enable researchers to use more
information and less biased data in cancer research, enhance the validity, robustness and generalizability of
research findings, and offer
research
substantial benefits in areas including, but not limited to, cancer disparities
and informatics practice.
Publications
Evaluating generalizability of oncology trial results to real-world patients using machine learning-based trial emulations.
Authors: Orcutt X.
, Chen K.
, Mamtani R.
, Long Q.
, Parikh R.B.
.
Source: Nature Medicine, 2025-01-03 00:00:00.0; , .
EPub date: 2025-01-03 00:00:00.0.
PMID: 39753967
Related Citations
De-identification is not enough: a comparison between de-identified and synthetic clinical notes.
Authors: Sarkar A.R.
, Chuang Y.S.
, Mohammed N.
, Jiang X.
.
Source: Scientific Reports, 2024-11-29 00:00:00.0; 14(1), p. 29669.
EPub date: 2024-11-29 00:00:00.0.
PMID: 39613846
Related Citations
Deep learning-based approaches for multi-omics data integration and analysis.
Authors: Ballard J.L.
, Wang Z.
, Li W.
, Shen L.
, Long Q.
.
Source: Biodata Mining, 2024-10-02 00:00:00.0; 17(1), p. 38.
EPub date: 2024-10-02 00:00:00.0.
PMID: 39358793
Related Citations
DISCRET: Synthesizing Faithful Explanations For Treatment Effect Estimation.
Authors: Wu Y.
, Keoliya M.
, Chen K.
, Velingker N.
, Li Z.
, Getzen E.J.
, Long Q.
, Naik M.
, Parikh R.B.
, Wong E.
.
Source: Proceedings Of Machine Learning Research, 2024 Jul; 235, p. 53597-53618.
PMID: 39205826
Related Citations
Filling the gaps: leveraging large language models for temporal harmonization of clinical text across multiple medical visits for clinical prediction.
Authors: Choi I.
, Long Q.
, Getzen E.
.
Source: Medrxiv : The Preprint Server For Health Sciences, 2024-05-07 00:00:00.0; , .
EPub date: 2024-05-07 00:00:00.0.
PMID: 38765975
Related Citations
SADI: Similarity-Aware Diffusion Model-Based Imputation for Incomplete Temporal EHR Data.
Authors: Dai Z.
, Getzen E.
, Long Q.
.
Source: Proceedings Of Machine Learning Research, 2024 May; 238, p. 4195-4203.
PMID: 39267895
Related Citations
SAFER: sub-hypergraph attention-based neural network for predicting effective responses to dose combinations.
Authors: Tang Y.C.
, Li R.
, Tang J.
, Zheng W.J.
, Jiang X.
.
Source: Research Square, 2024-04-30 00:00:00.0; , .
EPub date: 2024-04-30 00:00:00.0.
PMID: 38746131
Related Citations
Deep learning to predict rapid progression of Alzheimer's disease from pooled clinical trials: A retrospective study.
Authors: Ma X.
, Shyer M.
, Harris K.
, Wang D.
, Hsu Y.C.
, Farrell C.
, Goodwin N.
, Anjum S.
, Bukhbinder A.S.
, Dean S.
, et al.
.
Source: Plos Digital Health, 2024 Apr; 3(4), p. e0000479.
EPub date: 2024-04-10 00:00:00.0.
PMID: 38598464
Related Citations
SPeC: A Soft Prompt-Based Calibration on Performance Variability of Large Language Model in Clinical Notes Summarization.
Authors: Chuang Y.N.
, Tang R.
, Jiang X.
, Hu X.
.
Source: Journal Of Biomedical Informatics, 2024 Mar; 151, p. 104606.
EPub date: 2024-02-05 00:00:00.0.
PMID: 38325698
Related Citations
Disentangling Accelerated Cognitive Decline from the Normal Aging Process and Unraveling Its Genetic Components: A Neuroimaging-Based Deep Learning Approach.
Authors: Dai Y.
, Hsu Y.C.
, Fernandes B.S.
, Zhang K.
, Li X.
, Enduru N.
, Liu A.
, Manuel A.M.
, Jiang X.
, Zhao Z.
, et al.
.
Source: Journal Of Alzheimer's Disease : Jad, 2024; 97(4), p. 1807-1827.
PMID: 38306043
Related Citations
FERI: A Multitask-based Fairness Achieving Algorithm with Applications to Fair Organ Transplantation.
Authors: Li C.
, Lai D.
, Jiang X.
, Zhang K.
.
Source: Amia Joint Summits On Translational Science Proceedings. Amia Joint Summits On Translational Science, 2024; 2024, p. 593-602.
EPub date: 2024-05-31 00:00:00.0.
PMID: 38827050
Related Citations
PFERM: A Fair Empirical Risk Minimization Approach with Prior Knowledge.
Authors: Hou B.
, Mondragón A.
, Tarzanagh D.A.
, Zhou Z.
, Saykin A.J.
, Moore J.H.
, Ritchie M.D.
, Long Q.
, Shen L.
.
Source: Amia Joint Summits On Translational Science Proceedings. Amia Joint Summits On Translational Science, 2024; 2024, p. 211-220.
EPub date: 2024-05-31 00:00:00.0.
PMID: 38827072
Related Citations
Bridging the Gap: Rademacher Complexity in Robust and Standard Generalization.
Authors: Xiao J.
, Sun R.
, Long Q.
, Su W.J.
.
Source: Proceedings Of Machine Learning Research, 2024; 247, p. 5074-5075.
PMID: 39206101
Related Citations
Fair Canonical Correlation Analysis.
Authors: Zhou Z.
, Tarzanagh D.A.
, Hou B.
, Tong B.
, Xu J.
, Feng Y.
, Long Q.
, Shen L.
.
Source: Advances In Neural Information Processing Systems, 2023 Dec; 36, p. 3675-3705.
PMID: 38665178
Related Citations
Predicting multiple sclerosis severity with multimodal deep neural networks.
Authors: Zhang K.
, Lincoln J.A.
, Jiang X.
, Bernstam E.V.
, Shams S.
.
Source: Bmc Medical Informatics And Decision Making, 2023-11-09 00:00:00.0; 23(1), p. 255.
EPub date: 2023-11-09 00:00:00.0.
PMID: 37946182
Related Citations
Characterizing Treatment Non-responders vs. Responders in Completed Alzheimer's Disease Clinical Trials.
Authors: Wang D.
, Ling Y.
, Harris K.
, Schulz P.E.
, Jiang X.
, Kim Y.
.
Source: Medrxiv : The Preprint Server For Health Sciences, 2023-10-30 00:00:00.0; , .
EPub date: 2023-10-30 00:00:00.0.
PMID: 37961216
Related Citations
DiscoverPath: A Knowledge Refinement and Retrieval System for Interdisciplinarity on Biomedical Research.
Authors: Chuang Y.N.
, Wang G.
, Chang C.Y.
, Lai K.H.
, Zha D.
, Tang R.
, Yang F.
, Reyes A.C.
, Zhou K.
, Jiang X.
, et al.
.
Source: Proceedings Of The ... Acm International Conference On Information & Knowledge Management. Acm International Conference On Information And Knowledge Management, 2023 Oct; 2023, p. 5021-5025.
EPub date: 2023-10-21 00:00:00.0.
PMID: 38832084
Related Citations
Disentangling accelerated cognitive decline from the normal aging process and unraveling its genetic components: A neuroimaging-based deep learning approach.
Authors: Dai Y.
, Hsu Y.C.
, Fernandes B.S.
, Zhang K.
, Li X.
, Enduru N.
, Liu A.
, Manuel A.M.
, Jiang X.
, Zhao Z.
.
Source: Research Square, 2023-09-08 00:00:00.0; , .
EPub date: 2023-09-08 00:00:00.0.
PMID: 37720047
Related Citations
Fairness-Aware Class Imbalanced Learning on Multiple Subgroups.
Authors: Tarzanagh D.A.
, Hou B.
, Tong B.
, Long Q.
, Shen L.
.
Source: Proceedings Of Machine Learning Research, 2023 Aug; 216, p. 2123-2133.
PMID: 38601022
Related Citations
Large Language Models for Healthcare Data Augmentation: An Example on Patient-Trial Matching.
Authors: Yuan J.
, Tang R.
, Jiang X.
, Hu X.
.
Source: Amia ... Annual Symposium Proceedings. Amia Symposium, 2023; 2023, p. 1324-1333.
EPub date: 2024-01-11 00:00:00.0.
PMID: 38222339
Related Citations
Sensitive Data Detection with High-Throughput Machine Learning Models in Electrical Health Records.
Authors: Zhang K.
, Jiang X.
.
Source: Amia ... Annual Symposium Proceedings. Amia Symposium, 2023; 2023, p. 814-823.
EPub date: 2024-01-11 00:00:00.0.
PMID: 38222389
Related Citations
Leveraging informative missing data to learn about acute respiratory distress syndrome and mortality in long-term hospitalized COVID-19 patients throughout the years of the pandemic.
Authors: Getzen E.
, Tan A.L.
, Brat G.
, Omenn G.S.
, Strasser Z.
, Consortium for Clinical Characterization of COVID-19 by EHR (4CE) (Collaborative Group/Consortium)
, Long Q.
, Holmes J.H.
, Mowery D.
.
Source: Amia ... Annual Symposium Proceedings. Amia Symposium, 2023; 2023, p. 942-950.
EPub date: 2024-01-11 00:00:00.0.
PMID: 38222425
Related Citations
Minimax Estimation for Personalized Federated Learning: An Alternative between FedAvg and Local Training?
Authors: Chen S.
, Zheng Q.
, Long Q.
, Su W.J.
.
Source: Journal Of Machine Learning Research : Jmlr, 2023; 24, .
PMID: 39105110
Related Citations