top of page

DSJSON: A Python Package to Make Dataset-JSON Simple

  • Writer: Trinath Panda
    Trinath Panda
  • 12 hours ago
  • 3 min read

Making Dataset-JSON files usually means a lot of manual work—lining up data, adding metadata by hand, and checking if it all matches. I wanted to make that easier. So, I built a Python package that takes your SDTM/ADaM data and metadata (CSV, Excel, or JSON) and turns it straight into a Dataset-JSON v1.1 file.


Dataset-JSON python package

Under the Hood: The Implementation

Core Components

Detail the two main functions:

  • load_metadata: For loading column metadata from various sources.

  • to_dataset_json: For creating the final Dataset-JSON structure.


Key Design Decisions

  • Automatic generation of datasetJSONCreationDateTime.

  • Enforcement of required top-level fields like name, label, and itemGroupOID.

  • Validation to ensure that the columns in the metadata and data match.


The Development Journey

I’ll be honest: building a package always looks simple until you start.

Step 1: Framing the Problem

I broke down the Dataset-JSON spec into minimum viable building blocks:

  • Rows: the actual data.

  • Columns metadata: name, label, data type, and mapping.

  • Top-level metadata: datasetJSONversion, originator, datasetJSONCreationDateTime, etc.

If I could get those three aligned, the rest would fall into place.


Step 2: Implementation Choices

  • Base on pandas: Every clinical programmer moving into Python touches pandas. That had to be my foundation.

  • Metadata flexibility: Some teams maintain metadata in CSVs, some in Excel, some in JSON. So, I built loaders for all three.

  • Minimal dependencies: Keep it lean so it doesn’t break when someone installs it in a restricted clinical IT environment.


Step 3: The First Working Version

The first time I ran:

from dsjson import load_metadata, to_dataset_json
import pandas as pd
rows = pd.read_csv("examples/vs.csv")
columns = load_metadata("examples/columns_vs.csv", file_type="csv")
ds = to_dataset_json(rows, columns, dataset_name="VS", dataset_label="Vital Signs")

…and it produced a valid Dataset-JSON file. That was one of those small developer victories that feels huge.


From Local Script to Python Package

Here’s where things got interesting. Writing code is one thing; turning it into a package others can install with pip is a different game.

I had to:

  • Structure the repo properly (core, tests, examples).

  • Write documentation that wouldn’t make people quit after the first read.

  • Create a CHANGELOG.md (because future me will forget why I changed things).

  • Push to PyPI with a clean versioning system.

On August 19, 2025, I finally tagged and released v1.0 to PyPI. That moment when pip install dsjson actually worked. Priceless.


What the Package Does Today

At v1.0, DSJSON is focused and pragmatic:

  • Input: Any pandas-friendly dataset (CSV, Excel, JSON) + column metadata from CSV, Excel, or JSON.

  • Output: A conformant Dataset-JSON v1.1 file with all required metadata (datasetJSONVersion, datasetJSONCreationDateTime, name, label, itemGroupOID, columns, rows, records, originator, sourceSystem_name, etc.).

  • Utility functions: Load metadata, validate structure, and generate Dataset-JSON in one shot.

It doesn’t try to be everything. It just does one job well: make Dataset-JSON generation simple and reproducible.


Lessons Learned Along the Way

  • Simplicity wins: Don’t try to build the “perfect” package on day one. Ship something useful, then improve.

  • Documentation matters: If you don’t explain it well, even good code looks unusable.

  • Versioning discipline: Writing a changelog is boring… until it saves you from asking “what the hell did I change last month?”

  • Releasing is a skill: Getting it onto PyPI took as much learning as writing the code itself.


What’s Next

The roadmap is clear:

  • Add support for XML metadata input.

  • Build validation against the official Dataset-JSON schema.

  • Explore integration with FHIR resources for real-world data pipelines.


Closing Thoughts

Version 1.0 is just the beginning. The package will grow with feedback, new ideas, and real-world use. It’s out there now easy to try, easy to use, and open for contributions.


The code is on GitHub: DSJSON-PY.

Checkout the PyPI package: dsjson


If you have ideas or find issues, open an Issue or drop me a message. Let’s keep making clinical data tools better together.

コメント

5つ星のうち0と評価されています。
まだ評価がありません

評価を追加

Stay Connected

  • GitHub
  • LinkedIn
  • Twitter
  • Instagram

© 2025 By Trinath Panda

bottom of page