Digital twins of Microbial Ecosystems for Predicting Early Life Dysbiosis

&

Beyond

Ishanu Chattopadhyay, PhD

Assistant Professor of Biomedical Informatics & Computer Science

University of Kentucky

THE PROBLEM

Can microbial assay from gut actionably

pre-empt developmental deficit?

Boston U

U Chicago 

Two centers

Sizemore, Nicholas, Kaitlyn Oliphant, Ruolin Zheng, Camilia R. Martin, Erika C. Claud, and Ishanu Chattopadhyay. "A digital twin of the infant microbiome to predict neurodevelopmental deficits." Science Advances 10, no. 15 (2024): eadj0400.

Purely wet-lab investigations are insufficient to understand complex ecosystems

millions of years

We need to scale up!

We need a digital twin of the dataset

THE PROBLEM

Can microbial assay from gut actionably

pre-empt developmental deficit?

Predicting neurodevelopmental deficits

Forecasting ecosystem trajectories

Build classifiers

Ability to "fill in" missing data is equivalent to making trajectory forecasts

Forecast ecosystem fluctuations

Large Science Models 

(earlier called q-nets/ quasinets)

  • Consider a large number (\(n\)) of coupled observables
    • We dont know potential couplings a priori

 

  • Compute all \(x_i \vert x_1,\cdots,x_{i-1},x_{i+1},\cdots, x_n\) conditionals

 

  • Estimate the joint distribution of all variables

Brook’s lemma*

*Brook, D. (1964). On the distinction between the conditional probability and the joint probability distribution. Journal of the Royal Statistical Society. Series B (Methodological), 26(2), 295–307.

perfect inference of these conditionals gives us a unique joint

The central problem of ML is computing joint distributions or estimates thereof

hard!

*Hothorn, Torsten, Kurt Hornik, and Achim Zeileis. "Unbiased recursive partitioning: A conditional inference framework." Journal of Computational and Graphical statistics 15, no. 3 (2006): 651-674.

LSM Forest of Conditional Inference Trees*

Revealing Emergent Cross-talk between mutations in a viral protein (Influenza A HA)

Component predictor (Conditional Inference Tree*)

Example: Influenza A HA protein

LSM Forest

H3N2 2021 Influenza A HA

  • Set of conditional inference trees (CIT)
    • Strict statistical guarantees: quantifies inference uncertainty
  • Each tree models exactly one variable as a function of potentially all other variables
  • Non-leaf nodes are "hyperlinked" to other trees

Recursive LSM forest: hyperlinked nodes capturing emergent macro-structures

LSM Forest

H5N1 2013 Influenza A HA

  • Set of conditional inference trees (CIT)
    • Strict statistical guarantees: quantifies inference uncertainty
  • Each tree models exactly one variable as a function of potentially all other variables
  • Non-leaf nodes are "hyperlinked" to other trees

Recursive LSM forest: hyperlinked nodes capturing emergent macro-structures

LSM Forest

Recursive LSM forest: hyperlinked nodes capturing emergent macro-structures

  • Set of conditional inference trees (CIT)
    • Strict statistical guarantees: quantifies inference uncertainty
  • Each tree models exactly one variable as a function of potentially all other variables
  • Non-leaf nodes are "hyperlinked" to other trees

Influenza C HEF

LSM Forest

Recursive LSM forest: hyperlinked nodes capturing emergent macro-structures

GSS 2018 dataset

  • Set of conditional inference trees (CIT)
    • Strict statistical guarantees: quantifies inference uncertainty
  • Each tree models exactly one variable as a function of potentially all other variables
  • Non-leaf nodes are "hyperlinked" to other trees

Large Science Models

GSS 2018 dataset

  • Each predictor is inferred independently
  • Can scale up to thousands of variables in Python implementation
  • Further scale-up \(10^6 - 10^8\) needs C/C++ implementation

Full Example  of Hyperlinked Trees

Why are we talking about social modeling?

/unpopular opinion

A decade of 16S rRNA studies—pumping out "diversity indices" and entropy curves—has failed to deliver biological insight.

 

The elephant in the room:

we keep measuring shadows of complexity instead of modeling interactions.

We need new methods to model and understand complex systems

  • Develop Foundation models of complex systems with
    • hundreds to thousands of evolving variables with apriori unknown cross-talk
    • Learn intrinsic system geometry from data
    • Inference under data sparsity
  • Detect data (in)sufficiency
  • Support simulation and perturbation analysis

Large Science Models: Mathematical Framework

\begin{aligned} \text{Observables:} \quad & \color{yellow}X = \{x^1, \ldots, x^N\}, \overbrace{x^i \in \Sigma^i}^{\text{finite alphabet}} \\ {\color{gray}\text{Notation:} }\quad & \color{gray} x^{-i} = \{x^j : j \ne i\}\\ \text{Crosstalk:} \quad & \forall i \ P(x^i) = \color{red} f_i(x^{-i}) \\ \text{System state:} \quad & \color{Cyan} \psi = \bigotimes_{i=1}^N \psi^i, \quad \psi^i \in \mathscr{D}(\Sigma^i) \cup \varnothing \\ %\textbf{Degenerate case:} \quad & \psi^i \text{ is a delta distribution (fully observed)} \\ {\color{gray}\text{Notation:} }\quad & \color{gray}\psi^{-i} = \bigotimes_{j \ne i} \psi^j \end{aligned}
species A speciesB species C --- species n
Person 1
Person 2
---
Person m

observables

samples

Distributions over alphabet \(\Sigma^i\)

\phi = \bigotimes_{i=1}^N \phi^i, \quad \phi^i(\psi^{-i}) \in \mathscr{D}(\Sigma^i) \\

Individual Predictor (CIT)

cross-talk

\phi(\psi) \vert \vert \psi

Tension between predicted and observed distribution drives change

\(\psi^i\)

very high high average low very low
\Sigma^i

Digital Twin

\phi^i(\psi^{-i}) \sim \widetilde{\psi}^i

\(\phi\) estimates \(\psi\)

Examples: GSS, ANES, WVS, ESS, Eurobarometer, Afrobarometer, Asian Barometer etc

Bacteriodota

individual

estimate is always a non-empty non-degenerate distribution

missing observation

can also be time-varying

How Complex Are These Models?

Hundreds of thousands to 10s of millions of features

The Goal: Create a digital twin which can reveal valid perturbations

Large Science Models: Properties

LSM-Distance Metric*

\theta(\psi,\psi') \triangleq \frac{1}{N}\sum_{i=1}^{N} \sqrt{D_{JS}\Bigl(\phi^i(\psi^{-i}) \vert \vert \phi^i(\psi'^{-i})\Bigr)}

 where \(D_{JS}(P\vert \vert Q)\) is the Jensen-Shannon divergence.

g_{ij}(\psi) \;=\; \frac{1}{2}\,\frac{\partial^2}{\partial \psi^i\,\partial \psi^j}\,\theta^2(\psi,\psi')\Biggr|_{\psi'=\psi}
\left \lvert \ln \frac{\Pr(\psi\to \psi')}{\Pr(\psi' \rightarrow \psi')}\right \rvert \le \beta\,\theta(\psi,\psi')

Large Deviation Bound*

 Induced  Riemannian metric tensor

This bound connects ``closeness'' of samples to the odds of perturbing from one to the other, bridging geometry to dynamics

Ergodic Projection

\psi_\star \triangleq \bigotimes_{i=1}^N\phi^i\left (\prod_{1}^{N-1}\varnothing\right )

(Sanov's Theorem, Pinkser's Inequality)

\(\psi\)

\(\psi'\)

\(\theta\)

"spatial average":  average of all plausible worldviews or states

* Sizemore, Nicholas, Kaitlyn Oliphant, Ruolin Zheng, Camilia R. Martin, Erika C. Claud, and Ishanu Chattopadhyay. "A digital twin of the infant microbiome to predict neurodevelopmental deficits." Science Advances 10, no. 15 (2024): eadj0400.  https://www.science.org/doi/full/10.1126/sciadv.adj0400

persistence probability

Ergodic dispersion

\Psi_\star = \theta(\psi,\psi_\star)

Central to Model Drift Quantification

Start with opinion vector with all entries missing

This is a standard Physics construct, quantifying curvature of the underlying latent geometry

Pr(\psi \rightarrow \psi')

Easily computable in LSM framework!

Apply \(\phi^i\)

Random variable quantifying dispersion around the spatial average of worlviews

const. scaling as \(N^2\) 

  • Infer LSM for typical development
  • Infer LSM for infants who eventually experience developmental deficit
\mathcal{H}
\mathcal{D}
\psi_0
\psi_t

Completely uninformative state

Observed state

?

\rho(x) =\frac{\theta_\mathcal{H}(\psi_0,x)}{\theta_\mathcal{D}(\psi_0,x)}

Risk

Geometric Interpretation

LSM-based Risk

How different are the individual estimators for typical and deficit models?

Bacilli 30

typical 

deficit

Coriobacteria 32

typical 

deficit

Gammaproteobacteria 32

typical 

deficit

All Patients

Feeding Variables added

Forecasting of class abundance variations both

in-sample (\( R^2 \ge 95\%\))

and out-of sample (\(R^2\ge 72\%\))

Forecasting of class abundance variations both

in-sample (\( R^2 \ge 95\%\))

and out-of sample (\(R^2\ge 72\%\))

Building classifier based on LSM metric

But how exactly are we using the LSM metric?

Which entities are most predictive?

Interpreting Results

Just add those microbes back?

No! Our results indicate that supplantations need to be patient specific

No transplantation is guaranteed to work reliably

Predicted to reduce

risk reliably

Predicted to reduce

risk reliably

Supplantation MUST be bacteroidia

Supplantation MUST be Actinobacteria

No risk-decreasing supplantation

Network Interpretations? We see clear differences between two cases

Typical

Deficit

Integrating Clinical Features

Future

Answer the question: "what is a healthy microbiome?"

 

Explicit supplantation profiles that are tuned to individual ecosystems

Metabolomics

Dataset from Metabolomics Workbench

Study ID ST000923
Study Title Longitudinal Metabolomics of the Human Microbiome in Inflammatory Bowel Disease
Institute Broad Institute of MIT and Harvard
Last Name Avila-Pacheco
First Name Julian
Submit Date 2017-11-14
Num Groups 3
Total Subjects 546
Num Males 276
Num Females 270
Analysis Type Detail LC-MS

State-of-art microbiome based Classification  (~10 species) *

IBD vs UC 0.82
IBD vs CD 0.76

*Zheng, J., et al. (2024). Noninvasive, microbiome-based diagnosis of inflammatory bowel disease. Nature Medicine, 30(12), 3555–3567. https://doi.org/10.1038/s41591-024-03280-4

IBD vs non IBD 0.85

Gut-Metabolome based Classification  (~36 metabolites) *

LSM Forest

Non-IBD metabolomic profile*

Recursive LSM forest: hyperlinked nodes capturing emergent macro-structures

*Lloyd-Price J,et al.. Multi-omics of the gut microbial ecosystem in inflammatory bowel diseases. Nature. 2019 May;569(7758):655-662. doi: 10.1038/s41586-019-1237-9. Epub 2019 May 29. PMID: 31142855; PMCID: PMC6650278.

LSM Forest

Ulcerative Collitis metabolomic profile

Recursive LSM forest: hyperlinked nodes capturing emergent macro-structures

Metabolomics

LSM model

  • No. parameters: 70 million
  • Out of sample n=150
  • Uses all names and unnamed metabolites (>81K features)
AUC (out of sample)
Healthy vs IBD 96.1%
Healthy vs UC 92%
UC vs CD 100%
Healthy vs CD 100%

Metabolomics

LSM model

  • No. parameters: 70 million
  • Out of sample n=150
  • Uses all names and unnamed metabolites (>81K features)
AUC (out of sample)
Healthy vs IBD 96.1%
Healthy vs UC 92%
UC vs CD 100%
Healthy vs CD 100%

Metabolomics

Insight: the discriminating hypersurface is 2d (almost 1d)

LSM model for healthy profiles

What is a healthy Microbiome/ Metabolome?

\mathcal{H}

Any profile generated by \(\mathcal{H}\) is a healthy profile, while they might be different from one another

A Universal Risk Index

\theta_\mathcal{H}(\psi_\star,x)

average healthy profile

Future

  • How to interpret models with 81K metabolites?

    • (Not many large effect -size features, many many low effect size features)
  • More data, more indications (collaborate?)

ishanu_ch@uky.edu

https://slides.com/ishanu/lsm_biome

  • contribute/analyze your data
  • New grant idea?
  • Write a paper?

baylor talk

By Ishanu Chattopadhyay

baylor talk

Microbiome LSM

  • 24