How Private Are DNA Embeddings? Inverting Foundation Model Representations of Genomic Sequences
NOTE
Authors: Sofiane Ouaari*, Jules Kreuer*, Nico Pfeifer
*: Shared first authorship.
DOI: 10.48550/arXiv.2603.06950
/not-a-feature/DNA_Embedding_Inversion
Abstract
DNA foundation models have become transformative tools in bioinformatics and healthcare applications. Trained on vast genomic datasets, these models can be used to generate sequence embeddings, dense vector representations that capture complex genomic information. These embeddings are increasingly being shared via Embeddings-as-a-Service (EaaS) frameworks to facilitate downstream tasks, while supposedly protecting the privacy of the underlying raw sequences. However, as this practice becomes more prevalent, the security of these representations is being called into question. This study evaluates the resilience of DNA foundation models to model inversion attacks, whereby adversaries attempt to reconstruct sensitive training data from model outputs. In our study, the model’s output for reconstructing the DNA sequence is a zero-shot embedding, which is then fed to a decoder. We evaluated the privacy of three DNA foundation models: DNABERT-2, Evo 2, and Nucleotide Transformer v2 (NTv2). Our results show that per-token embeddings allow near-perfect sequence reconstruction across all models. For mean-pooled embeddings, reconstruction quality degrades as sequence length increases, though it remains substantially above random baselines. Evo 2 and NTv2 prove to be most vulnerable, especially for shorter sequences with reconstruction similarities > 90%, while DNABERT-2’s BPE tokenization provides the greatest resilience. We found that the correlation between embedding similarity and sequence similarity was a key predictor of reconstruction success. Our findings emphasize the urgent need for privacy-aware design in genomic foundation models prior to their widespread deployment in EaaS settings.
Fig. 3. Mean-pooled reconstruction performance across sequence lengths for the encoder-only architecture: (a) Levenshtein similarity and (b) nucleotide accuracy.
Preprint & Model Weights
The full preprint can be found here: arxiv.org/abs/2603.06950
Model weights and raw output can be found here: huggingface.co/not-a-feature/DNA_Embedding_Inversion
License
This project is licensed under the LGPL-2.1 License.
Acknowledgments
This work was supported by the Carl Zeiss Stiftung Research Project “Certification and Foundations of Safe Machine Learning Systems in Healthcare”, by the German Research Foundation (DFG) under Germany’s Excellence Strategy—EXC number 2064/1—Project number 390727645, and by the German Federal Ministry of Research, Technology and Space (BMFTR) within the PrivateAIM project (funding number: 01ZZ2316D). The authors thank the International Max Planck Research School for Intelligent Systems (IMPRS-IS).