ECG is All You Need — Democratizing Cardiac Imaging with LLMs

The Problem

Over two billion people lack access to echocardiography — the primary tool for assessing cardiac structure and function. In resource-constrained settings across South Asia, Sub-Saharan Africa, and rural communities worldwide, patients with suspected heart failure, valvular disease, or post-infarction complications face diagnostic delays measured in weeks or months, not minutes.

The 12-lead electrocardiogram, by contrast, costs under $2 per test and is available in virtually every clinic. But ECG is conventionally understood as an electrophysiological tool — it measures electrical activity, not structural anatomy. This study asks: can modern AI extract latent structural information from ECG waveforms, effectively inferring what an echocardiogram would show?

Approach

We conducted a retrospective comparative analysis of 311 consecutive paired ECG–echocardiogram cases from a single cardiac centre. Three leading large language models — Grok 4, Claude Opus 4, and ChatGPT-4o — were given 12-lead ECG images and asked to generate complete echocardiographic reports through structured prompt engineering, with zero-shot inference and no fine-tuning.

📊

12-Lead ECG
Image

→

🧠

LLM
(Zero-shot)

→

📋

Synthetic
ECHO Report

→

⚖️

Statistical
Comparison

→

🏥

Cardiologist
Gold Standard

LLM-generated reports were compared against cardiologist-authored echocardiograms across ejection fraction (EF), diastolic dysfunction grading, valve regurgitation severity, and regional wall motion abnormalities (RWMA). Statistical analysis employed Pearson correlation, ICC, Bland-Altman analysis, weighted Cohen's Kappa, and Friedman test with post-hoc comparisons.

Key Results

Claude Opus 4 achieved the highest concordance with cardiologist assessments across all primary metrics — Pearson r = 0.394 (p < 0.001), ICC = 0.368, with 63.6% of EF predictions falling within ±5% of true values. This represents a statistically significant advantage over both Grok 4 and ChatGPT (Friedman χ² = 36.3, p < 0.001).

EF correlation scatter plots for three LLMs vs cardiologist assessment

Figure 1. Ejection fraction correlation between LLM predictions and cardiologist assessments. Claude Opus 4 (centre) shows the strongest linear relationship (r = 0.394, p < 0.001, n = 272). Dashed line indicates perfect agreement.

Bland-Altman plot showing EF agreement between Claude Opus 4 and cardiologist

Figure 2. Bland-Altman analysis for Claude Opus 4. Mean bias of +0.3% with limits of agreement −18.6% to +19.3%, indicating minimal systematic error.

Performance summary across all metrics and models

Figure 3. Comprehensive performance across EF correlation, RWMA detection, clinical agreement distribution, and documentation completeness.

Performance metrics table comparing all three LLMs

Table 1. Summary performance metrics. Claude Opus 4 leads across EF correlation, MAE, clinical agreement, and RWMA sensitivity. *** p < 0.001.

For diastolic dysfunction grading, Claude Opus 4 achieved fair agreement (weighted κ = 0.324), outperforming both Grok 4 (κ = 0.172) and ChatGPT (κ = 0.135). RWMA detection showed a conservative prediction pattern across all models — high specificity (82.8–91.9%) but limited sensitivity (16.7–46.1%), suggesting the models prioritise avoiding false positives.

A notable secondary finding: LLMs demonstrated superior documentation completeness compared to cardiologists — documenting EF in 99% of reports versus 87.5% in cardiologist reports, and diastolic dysfunction in 99–100% versus 91.9%.

Clinical Implications

Current performance is insufficient for diagnostic replacement of echocardiography. However, the results establish proof-of-concept for ECG-based cardiac triage: in settings where echo is unavailable, an LLM-generated preliminary assessment could identify patients requiring urgent referral (specificity 87%) at a cost of $2 per ECG versus $100+ for echo plus transport.

The clinical workflow we envision: a rural health worker obtains a 12-lead ECG, generates an LLM-based preliminary cardiac assessment, and uses the output to decide whether to refer — reducing blind referrals while catching approximately half of significant wall motion abnormalities that would otherwise go undetected until clinical deterioration.

Key Innovation

First demonstration that large language models can perform cross-modal cardiac inference — generating echocardiographic assessments from ECG waveforms alone, without fine-tuning, training data, or direct imaging input. This suggests ECG waveforms encode exploitable latent structural information accessible to multimodal AI systems.

Ongoing & Future Work

Immediate next steps focus on improving RWMA detection sensitivity through model fine-tuning on annotated ECG–echo datasets and incorporating multimodal inputs (clinical history + ECG). We are pursuing external validation across multi-centre cohorts and diverse pathologies, with the long-term goal of a prospective clinical trial measuring time-to-appropriate-referral and 90-day patient outcomes.

Technical development includes ensemble methods combining multiple LLMs, uncertainty quantification with confidence scoring, explainability overlays identifying which ECG features drive predictions, and integration with point-of-care ECG devices for real-time deployment.