Over two billion people lack access to echocardiography — the primary tool for assessing cardiac structure and function. In resource-constrained settings across South Asia, Sub-Saharan Africa, and rural communities worldwide, patients with suspected heart failure, valvular disease, or post-infarction complications face diagnostic delays measured in weeks or months, not minutes.
The 12-lead electrocardiogram, by contrast, costs under $2 per test and is available in virtually every clinic. But ECG is conventionally understood as an electrophysiological tool — it measures electrical activity, not structural anatomy. This study asks: can modern AI extract latent structural information from ECG waveforms, effectively inferring what an echocardiogram would show?
We conducted a retrospective comparative analysis of 311 consecutive paired ECG–echocardiogram cases from a single cardiac centre. Three leading large language models — Grok 4, Claude Opus 4, and ChatGPT-4o — were given 12-lead ECG images and asked to generate complete echocardiographic reports through structured prompt engineering, with zero-shot inference and no fine-tuning.
Image
(Zero-shot)
ECHO Report
Comparison
Gold Standard
LLM-generated reports were compared against cardiologist-authored echocardiograms across ejection fraction (EF), diastolic dysfunction grading, valve regurgitation severity, and regional wall motion abnormalities (RWMA). Statistical analysis employed Pearson correlation, ICC, Bland-Altman analysis, weighted Cohen's Kappa, and Friedman test with post-hoc comparisons.
Claude Opus 4 achieved the highest concordance with cardiologist assessments across all primary metrics — Pearson r = 0.394 (p < 0.001), ICC = 0.368, with 63.6% of EF predictions falling within ±5% of true values. This represents a statistically significant advantage over both Grok 4 and ChatGPT (Friedman χ² = 36.3, p < 0.001).
For diastolic dysfunction grading, Claude Opus 4 achieved fair agreement (weighted κ = 0.324), outperforming both Grok 4 (κ = 0.172) and ChatGPT (κ = 0.135). RWMA detection showed a conservative prediction pattern across all models — high specificity (82.8–91.9%) but limited sensitivity (16.7–46.1%), suggesting the models prioritise avoiding false positives.
A notable secondary finding: LLMs demonstrated superior documentation completeness compared to cardiologists — documenting EF in 99% of reports versus 87.5% in cardiologist reports, and diastolic dysfunction in 99–100% versus 91.9%.
Current performance is insufficient for diagnostic replacement of echocardiography. However, the results establish proof-of-concept for ECG-based cardiac triage: in settings where echo is unavailable, an LLM-generated preliminary assessment could identify patients requiring urgent referral (specificity 87%) at a cost of $2 per ECG versus $100+ for echo plus transport.
The clinical workflow we envision: a rural health worker obtains a 12-lead ECG, generates an LLM-based preliminary cardiac assessment, and uses the output to decide whether to refer — reducing blind referrals while catching approximately half of significant wall motion abnormalities that would otherwise go undetected until clinical deterioration.
First demonstration that large language models can perform cross-modal cardiac inference — generating echocardiographic assessments from ECG waveforms alone, without fine-tuning, training data, or direct imaging input. This suggests ECG waveforms encode exploitable latent structural information accessible to multimodal AI systems.
Immediate next steps focus on improving RWMA detection sensitivity through model fine-tuning on annotated ECG–echo datasets and incorporating multimodal inputs (clinical history + ECG). We are pursuing external validation across multi-centre cohorts and diverse pathologies, with the long-term goal of a prospective clinical trial measuring time-to-appropriate-referral and 90-day patient outcomes.
Technical development includes ensemble methods combining multiple LLMs, uncertainty quantification with confidence scoring, explainability overlays identifying which ECG features drive predictions, and integration with point-of-care ECG devices for real-time deployment.