Dataset-JSON: The Future of Clinical Data Exchange

Trinath Panda
Aug 23
4 min read

If you’ve ever worked with regulatory submissions in clinical research, you’ve probably dealt with SAS XPT files. They’ve been the FDA’s standard since 1989, and honestly? They’ve overstayed their welcome.

Limited data types.
No Unicode support.
Hard caps on variable names, labels, and field lengths.
Bloated storage, painful extensibility, and metadata gaps.

That’s where CDISC Dataset-JSON comes in. Launched in 2023 as part of ODM v2.0, it’s a modern, JSON-based format designed to fix those headaches. And it’s not just a “new file type” — it’s a gateway to API-driven, real-time, interoperable clinical data pipelines.

In this post, I’ll break down what Dataset-JSON is, why it matters, who’s behind it, the challenges and debates around it, and where it’s headed. If you’re a programmer, data manager, or clinical data scientist, this will give you the clarity you need to start adopting it.

What is Dataset-JSON?

At its core, Dataset-JSON is a JSON-based transport format for SDTM, ADaM, and SEND datasets. Think of it as:

“Same content, different suitcase” — but a suitcase that finally fits today’s world.

Instead of packing your trial data into a rigid 1980s-style SAS XPT file, you now have a self-describing, extensible, lightweight format.

Key characteristics:

JSON foundation → Aligns with modern data exchange (like HL7 FHIR).
Built-in metadata → Variables, data types, labels live in the file itself.
Extensible → Easy to adapt for unique use cases.
Smaller file sizes → Often 20–40% smaller than XPT or XML.
Modern data types → Strings, decimals, booleans, dates, datetimes, URIs.

How a Dataset-JSON File is Structured

A Dataset-JSON file isn’t just rows of data — it’s organized and layered.

Technical Metadata
- Version, timestamps, source system.
Study Metadata
- Study identifiers, links to Define-XML if needed.
Dataset Information
- Name, label, record counts.
Column Metadata
- Variable names, data types, lengths, key sequence.
Rows (Data Records)
- The actual observations in array format.

This means you don’t need to crack open Define-XML to know what’s inside. You can open a Dataset-JSON in any JSON viewer or programmatically load it in Python, R, or SAS.

Why Dataset-JSON Exists

Let’s be real: XPT files worked for decades because there was no better option. But as trials scaled up, globalized, and diversified, the cracks showed:

Variables locked at 8 characters.
Labels capped at 40 characters.
Fields capped at 200 characters.
No way to represent Unicode (try handling multilingual trials with that).
No support for true datetime types.

Dataset-JSON fixes these by design. It’s not just more flexible — it’s future-ready.

The People Behind Dataset-JSON

Every standard has champions. For Dataset-JSON, a few names stand out:

Sam Hume (CDISC) → Lead architect, driving technical direction and integration with ODM v2.0.
Jesse Anderson (FDA) → Regulatory champion, bridging FDA needs with technical pilots.
Stuart Malcolm (Veramed) → CRO perspective, PHUSE co-lead for the pilot.
Lex Jansen (CDISC) → Technical expert, especially around SAS and metadata integration.

Their work, plus collaborations with FDA, PHUSE, and CDISC hackathons, is why we’re even having this conversation today.

Current Debates Around Dataset-JSON

Like any new standard, Dataset-JSON isn’t without controversy.

Tool readiness → Most working tools are open source (GitHub, hackathons). Critics argue commercial, validated tools lag behind.
Performance → While files are smaller, some conversions struggle with very large datasets. NDJSON (newline-delimited JSON) is emerging as a fix.
Migration costs → Moving from XPT to JSON means retooling pipelines. That’s time + money.
Regulatory adoption → FDA is piloting it, but global acceptance will take time.

Key Research and Pilots

FDA/PHUSE/CDISC Pilot (2023–2024)

Tested Dataset-JSON as a submission format.
Results: Generally successful — minor issues with date precision + formatting.
Takeaway: Technically sound, business processes need alignment.

File Size Analysis

Dataset-JSON consistently beats XPT and XML in file size.
Example: A large ADaM dataset went from 33,441 KB (XPT) → 24,942 KB (Dataset-JSON).

Hackathon Innovation

First hackathon produced 20+ tools (converters, viewers, Python/R/SAS utilities).
Proved feasibility, seeded open-source ecosystem.

Where Dataset-JSON is Headed

This isn’t just a file format upgrade. It’s the start of a bigger shift.

Define-JSON → Next evolution of Define-XML, for metadata in JSON.
API-first exchange → Supports real-time data sharing instead of batch file transfers.
Extended data capabilities → No more 8-char name limits, longer labels, bigger fields.
NDJSON + compression → Optimized for massive datasets.
Global adoption → FDA pilot is first step; EMA, PMDA, and others will follow.

In other words: Dataset-JSON is paving the way for live, API-driven clinical data pipelines, not just static file drops.

Practical Learning Path

If you’re thinking about adopting Dataset-JSON, here’s a roadmap:

Learn JSON basics → Quick tutorials (W3Resource, “Learn JSON in 10 Minutes” on YouTube).
Read the CDISC spec → Dataset-JSON v1.1 Specification.

Closing Thoughts

Dataset-JSON is not just a technical upgrade — it’s a paradigm shift for clinical research data.

Easier to exchange.
Smaller and faster.
Future-proofed with APIs and real-time integration.

Yes, there are challenges — tool maturity, regulatory timelines, migration costs. But the trajectory is clear: Dataset-JSON will replace XPT. It’s not a matter of if — it’s when.

If you’re in clinical data programming, now is the time to get hands-on, experiment, and prepare your pipelines.

Resources:

CDISC Dataset-JSON GitHub
FDA Pilot Report