Skip to content

Support synthetic data import #63

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
minrk opened this issue Apr 12, 2025 · 3 comments
Open

Support synthetic data import #63

minrk opened this issue Apr 12, 2025 · 3 comments

Comments

@minrk
Copy link
Contributor

minrk commented Apr 12, 2025

Both for testing and demo purposes, it would be extremely useful to support synthetic data import, e.g. from synthea. With that data, we could produce tests and example notebooks we can actually display publicly without needing to go through 'real' data population, which we then can't use as examples in documentation.

We'd have to do our own part to handle e.g. consents of fake accounts, etc. but I think that should not be hard.

@s1monj
Copy link
Collaborator

s1monj commented Apr 12, 2025

@minrk sounds good but I don't see any examples of generating device/wearable Observations over a time frame? I came across this from 2018 but that's per day - were you thinking of creating a custom module?

@minrk
Copy link
Contributor Author

minrk commented Apr 13, 2025

I’m not sure what the best way is. Even if there’s any appropriate sample data online that someone else might have published, and slotting that in would work, if that exists.

For BP, I was considering just generating data with synthea and inserting the values into what we have to create records. cgm has much more characteristic curves that wouldn’t work for. Even sampling a hand-drawn curve would be okay.

@minrk
Copy link
Contributor Author

minrk commented Apr 15, 2025

@maryamv brought up iglu, which has some sample data, which appears to originate from https://doi.org/10.1371/journal.pbio.2005143.s010 in this paper (105k records, 57 subjects, ~2k records per subject; most for about a week, some for much longer but similar sample count). We could use that to populate synthetic CGM data.

I think for BP, a random walk within range really ought to be enough, or we could extract numbers from Synthea, like I did here.

The main thing is:

  1. generating all the right fields for our schema
  2. loading it into JHE so we can run a test or demo 'for real' against JHE, but with non-sensitive output

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants