Digital twins of Microbial Ecosystems for Predicting Early Life Dysbiosis

Beyond

Ishanu Chattopadhyay, PhD

Assistant Professor of Biomedical Informatics & Computer Science

University of Kentucky

THE PROBLEM

Can microbial assay from gut actionably

pre-empt developmental deficit?

Boston U

U Chicago

Two centers

Sizemore, Nicholas, Kaitlyn Oliphant, Ruolin Zheng, Camilia R. Martin, Erika C. Claud, and Ishanu Chattopadhyay. "A digital twin of the infant microbiome to predict neurodevelopmental deficits." Science Advances 10, no. 15 (2024): eadj0400.

Purely wet-lab investigations are insufficient to understand complex ecosystems

millions of years

We need to scale up!

We need a digital twin of the dataset

THE PROBLEM

Can microbial assay from gut actionably

pre-empt developmental deficit?

Predicting neurodevelopmental deficits

Forecasting ecosystem trajectories

Build classifiers

Ability to "fill in" missing data is equivalent to making trajectory forecasts

Forecast ecosystem fluctuations

Large Science Models

(earlier called q-nets/ quasinets)

Consider a large number (\(n\)) of coupled observables
- We dont know potential couplings a priori

Compute all \(x_i \vert x_1,\cdots,x_{i-1},x_{i+1},\cdots, x_n\) conditionals

Estimate the joint distribution of all variables

Brook’s lemma*

*Brook, D. (1964). On the distinction between the conditional probability and the joint probability distribution. Journal of the Royal Statistical Society. Series B (Methodological), 26(2), 295–307.

perfect inference of these conditionals gives us a unique joint

The central problem of ML is computing joint distributions or estimates thereof

hard!

*Hothorn, Torsten, Kurt Hornik, and Achim Zeileis. "Unbiased recursive partitioning: A conditional inference framework." Journal of Computational and Graphical statistics 15, no. 3 (2006): 651-674.

LSM Forest of Conditional Inference Trees*

Revealing Emergent Cross-talk between mutations in a viral protein (Influenza A HA)

Component predictor (Conditional Inference Tree*)

Example: Influenza A HA protein

LSM Forest

H3N2 2021 Influenza A HA

Set of conditional inference trees (CIT)
- Strict statistical guarantees: quantifies inference uncertainty
Each tree models exactly one variable as a function of potentially all other variables
Non-leaf nodes are "hyperlinked" to other trees

Recursive LSM forest: hyperlinked nodes capturing emergent macro-structures

LSM Forest

H5N1 2013 Influenza A HA

Set of conditional inference trees (CIT)
- Strict statistical guarantees: quantifies inference uncertainty
Each tree models exactly one variable as a function of potentially all other variables
Non-leaf nodes are "hyperlinked" to other trees

Recursive LSM forest: hyperlinked nodes capturing emergent macro-structures

LSM Forest

Recursive LSM forest: hyperlinked nodes capturing emergent macro-structures

Set of conditional inference trees (CIT)
- Strict statistical guarantees: quantifies inference uncertainty
Each tree models exactly one variable as a function of potentially all other variables
Non-leaf nodes are "hyperlinked" to other trees

Influenza C HEF

LSM Forest

Recursive LSM forest: hyperlinked nodes capturing emergent macro-structures

GSS 2018 dataset

Set of conditional inference trees (CIT)
- Strict statistical guarantees: quantifies inference uncertainty
Each tree models exactly one variable as a function of potentially all other variables
Non-leaf nodes are "hyperlinked" to other trees

Large Science Models

GSS 2018 dataset

Each predictor is inferred independently
Can scale up to thousands of variables in Python implementation
Further scale-up \(10^6 - 10^8\) needs C/C++ implementation

https://34.66.189.202/data/trees2018/

Full Example of Hyperlinked Trees

Why are we talking about social modeling?

/unpopular opinion

A decade of 16S rRNA studies—pumping out "diversity indices" and entropy curves—has failed to deliver biological insight.

The elephant in the room:

we keep measuring shadows of complexity instead of modeling interactions.

We need new methods to model and understand complex systems

Develop Foundation models of complex systems with
- hundreds to thousands of evolving variables with apriori unknown cross-talk
- Learn intrinsic system geometry from data
- Inference under data sparsity
Detect data (in)sufficiency
Support simulation and perturbation analysis

Large Science Models: Mathematical Framework

\begin{aligned} \text{Observables:} \quad & \color{yellow}X = \{x^1, \ldots, x^N\}, \overbrace{x^i \in \Sigma^i}^{\text{finite alphabet}} \\ {\color{gray}\text{Notation:} }\quad & \color{gray} x^{-i} = \{x^j : j \ne i\}\\ \text{Crosstalk:} \quad & \forall i \ P(x^i) = \color{red} f_i(x^{-i}) \\ \text{System state:} \quad & \color{Cyan} \psi = \bigotimes_{i=1}^N \psi^i, \quad \psi^i \in \mathscr{D}(\Sigma^i) \cup \varnothing \\ %\textbf{Degenerate case:} \quad & \psi^i \text{ is a delta distribution (fully observed)} \\ {\color{gray}\text{Notation:} }\quad & \color{gray}\psi^{-i} = \bigotimes_{j \ne i} \psi^j \end{aligned}

	species A	speciesB	species C	---	species n
Person 1
Person 2
---
Person m

observables

samples

Distributions over alphabet \(\Sigma^i\)

\phi = \bigotimes_{i=1}^N \phi^i, \quad \phi^i(\psi^{-i}) \in \mathscr{D}(\Sigma^i) \\

Individual Predictor (CIT)

cross-talk

\phi(\psi) \vert \vert \psi

Tension between predicted and observed distribution drives change

\(\psi^i\)

very high

high

average

low

very low

\Sigma^i

Digital Twin

\phi^i(\psi^{-i}) \sim \widetilde{\psi}^i

\(\phi\) estimates \(\psi\)

Examples: GSS, ANES, WVS, ESS, Eurobarometer, Afrobarometer, Asian Barometer etc

Bacteriodota

individual

estimate is always a non-empty non-degenerate distribution

missing observation

can also be time-varying

How Complex Are These Models?

Hundreds of thousands to 10s of millions of features

The Goal: Create a digital twin which can reveal valid perturbations

Large Science Models: Properties

LSM-Distance Metric*

\theta(\psi,\psi') \triangleq \frac{1}{N}\sum_{i=1}^{N} \sqrt{D_{JS}\Bigl(\phi^i(\psi^{-i}) \vert \vert \phi^i(\psi'^{-i})\Bigr)}

where \(D_{JS}(P\vert \vert Q)\) is the Jensen-Shannon divergence.

g_{ij}(\psi) \;=\; \frac{1}{2}\,\frac{\partial^2}{\partial \psi^i\,\partial \psi^j}\,\theta^2(\psi,\psi')\Biggr|_{\psi'=\psi}

\left \lvert \ln \frac{\Pr(\psi\to \psi')}{\Pr(\psi' \rightarrow \psi')}\right \rvert \le \beta\,\theta(\psi,\psi')

Large Deviation Bound*

Induced Riemannian metric tensor

This bound connects ``closeness'' of samples to the odds of perturbing from one to the other, bridging geometry to dynamics

Ergodic Projection

\psi_\star \triangleq \bigotimes_{i=1}^N\phi^i\left (\prod_{1}^{N-1}\varnothing\right )

(Sanov's Theorem, Pinkser's Inequality)

\(\psi\)

\(\psi'\)

\(\theta\)

"spatial average": average of all plausible worldviews or states

* Sizemore, Nicholas, Kaitlyn Oliphant, Ruolin Zheng, Camilia R. Martin, Erika C. Claud, and Ishanu Chattopadhyay. "A digital twin of the infant microbiome to predict neurodevelopmental deficits." Science Advances 10, no. 15 (2024): eadj0400. https://www.science.org/doi/full/10.1126/sciadv.adj0400

persistence probability

Ergodic dispersion

\Psi_\star = \theta(\psi,\psi_\star)

Central to Model Drift Quantification

Start with opinion vector with all entries missing

This is a standard Physics construct, quantifying curvature of the underlying latent geometry

Pr(\psi \rightarrow \psi')

Easily computable in LSM framework!

Apply \(\phi^i\)

Random variable quantifying dispersion around the spatial average of worlviews

const. scaling as \(N^2\)

Infer LSM for typical development
Infer LSM for infants who eventually experience developmental deficit

\mathcal{H}

\mathcal{D}

\psi_0

\psi_t

Completely uninformative state

Observed state

\rho(x) =\frac{\theta_\mathcal{H}(\psi_0,x)}{\theta_\mathcal{D}(\psi_0,x)}

Risk

Geometric Interpretation

LSM-based Risk

How different are the individual estimators for typical and deficit models?

Bacilli 30

typical

deficit

Coriobacteria 32

typical

deficit

Gammaproteobacteria 32

typical

deficit

All Patients

Feeding Variables added

Forecasting of class abundance variations both

in-sample (\( R^2 \ge 95\%\))

and out-of sample (\(R^2\ge 72\%\))

Forecasting of class abundance variations both

in-sample (\( R^2 \ge 95\%\))

and out-of sample (\(R^2\ge 72\%\))

Building classifier based on LSM metric

But how exactly are we using the LSM metric?

Which entities are most predictive?

Interpreting Results

Just add those microbes back?

No! Our results indicate that supplantations need to be patient specific

No transplantation is guaranteed to work reliably

Predicted to reduce

risk reliably

Predicted to reduce

risk reliably

Supplantation MUST be bacteroidia

Supplantation MUST be Actinobacteria

No risk-decreasing supplantation

Network Interpretations? We see clear differences between two cases

Typical

Deficit

Integrating Clinical Features

Future

Answer the question: "what is a healthy microbiome?"

Explicit supplantation profiles that are tuned to individual ecosystems

Metabolomics

Dataset from Metabolomics Workbench

Study ID	ST000923
Study Title	Longitudinal Metabolomics of the Human Microbiome in Inflammatory Bowel Disease

Institute	Broad Institute of MIT and Harvard
Last Name	Avila-Pacheco
First Name	Julian
Submit Date	2017-11-14
Num Groups	3
Total Subjects	546
Num Males	276
Num Females	270
Analysis Type Detail	LC-MS

State-of-art microbiome based Classification (~10 species) *

IBD vs UC	0.82
IBD vs CD	0.76

*Zheng, J., et al. (2024). Noninvasive, microbiome-based diagnosis of inflammatory bowel disease. Nature Medicine, 30(12), 3555–3567. https://doi.org/10.1038/s41591-024-03280-4

IBD vs non IBD

0.85

Gut-Metabolome based Classification (~36 metabolites) *

LSM Forest

Non-IBD metabolomic profile*

Recursive LSM forest: hyperlinked nodes capturing emergent macro-structures

*Lloyd-Price J,et al.. Multi-omics of the gut microbial ecosystem in inflammatory bowel diseases. Nature. 2019 May;569(7758):655-662. doi: 10.1038/s41586-019-1237-9. Epub 2019 May 29. PMID: 31142855; PMCID: PMC6650278.

LSM Forest

Ulcerative Collitis metabolomic profile

Recursive LSM forest: hyperlinked nodes capturing emergent macro-structures

Metabolomics

LSM model

No. parameters: 70 million
Out of sample n=150
Uses all names and unnamed metabolites (>81K features)

	AUC (out of sample)
Healthy vs IBD	96.1%
Healthy vs UC	92%
UC vs CD	100%
Healthy vs CD	100%

Metabolomics

LSM model

No. parameters: 70 million
Out of sample n=150
Uses all names and unnamed metabolites (>81K features)

	AUC (out of sample)
Healthy vs IBD	96.1%
Healthy vs UC	92%
UC vs CD	100%
Healthy vs CD	100%

Metabolomics

Insight: the discriminating hypersurface is 2d (almost 1d)

LSM model for healthy profiles

What is a healthy Microbiome/ Metabolome?

\mathcal{H}

Any profile generated by \(\mathcal{H}\) is a healthy profile, while they might be different from one another

A Universal Risk Index

\theta_\mathcal{H}(\psi_\star,x)

average healthy profile

Future

How to interpret models with 81K metabolites?
- (Not many large effect -size features, many many low effect size features)
More data, more indications (collaborate?)

ishanu_ch@uky.edu

https://slides.com/ishanu/lsm_biome

contribute/analyze your data
New grant idea?
Write a paper?

baylor talk

By Ishanu Chattopadhyay

baylor talk

Microbiome LSM

Ishanu Chattopadhyay PRO

ML | Data Science Biomedical Informatics | Social Science | Assistant Professor

Digital twins of Microbial Ecosystems for Predicting Early Life Dysbiosis

THE PROBLEM

Can microbial assay from gut actionably

pre-empt developmental deficit?

Purely wet-lab investigations are insufficient to understand complex ecosystems

We need to scale up!

We need a digital twin of the dataset

THE PROBLEM

Can microbial assay from gut actionably

pre-empt developmental deficit?

Predicting neurodevelopmental deficits

Forecasting ecosystem trajectories

Build classifiers

Ability to "fill in" missing data is equivalent to making trajectory forecasts

Forecast ecosystem fluctuations

Large Science Models

Estimate the joint distribution of all variables

Brook’s lemma*

perfect inference of these conditionals gives us a unique joint

The central problem of ML is computing joint distributions or estimates thereof

hard!

LSM Forest of Conditional Inference Trees*

LSM Forest

Recursive LSM forest: hyperlinked nodes capturing emergent macro-structures

LSM Forest

Recursive LSM forest: hyperlinked nodes capturing emergent macro-structures

LSM Forest

Recursive LSM forest: hyperlinked nodes capturing emergent macro-structures

LSM Forest

Recursive LSM forest: hyperlinked nodes capturing emergent macro-structures

Large Science Models

GSS 2018 dataset

Full Example of Hyperlinked Trees

Why are we talking about social modeling?

The elephant in the room:

we keep measuring shadows of complexity instead of modeling interactions.

Large Science Models: Mathematical Framework

Digital Twin

How Complex Are These Models?

Large Science Models: Properties

LSM-Distance Metric*

Large Deviation Bound*

Induced Riemannian metric tensor

Ergodic Projection

Ergodic dispersion

Risk

Geometric Interpretation

LSM-based Risk

How different are the individual estimators for typical and deficit models?

Forecasting of class abundance variations both

in-sample (\( R^2 \ge 95\%\))

and out-of sample (\(R^2\ge 72\%\))

Forecasting of class abundance variations both

in-sample (\( R^2 \ge 95\%\))

and out-of sample (\(R^2\ge 72\%\))

But how exactly are we using the LSM metric?

Which entities are most predictive?

Interpreting Results

Just add those microbes back?

Network Interpretations? We see clear differences between two cases

Integrating Clinical Features

Metabolomics

LSM Forest

Recursive LSM forest: hyperlinked nodes capturing emergent macro-structures

LSM Forest

Recursive LSM forest: hyperlinked nodes capturing emergent macro-structures

Metabolomics

Metabolomics

Metabolomics

Insight: the discriminating hypersurface is 2d (almost 1d)

What is a healthy Microbiome/ Metabolome?

A Universal Risk Index

Future

How to interpret models with 81K metabolites?

baylor talk

More from Ishanu Chattopadhyay