false
Catalog
SCCM Resource Library
Data Security, Ownership, and Evolution to Common ...
Data Security, Ownership, and Evolution to Common Data Dictionaries
Back to course
[Please upgrade your browser to play this video content]
Video Transcription
All right, thank you. So quick, I just want to note a couple of disclosures. I'm a current advisor to Ciba Healthcare, and I'm a former employee of Philips Healthcare within the last two years. So objectives I'm gonna talk about today is hopefully at the end of this, attendees, you'll be able to describe some of the general requirements around obtaining security certifications, such as ISO 27001, explain the role of trusted research environments in collaborative research, and maybe introduce some new terms to you so you'll be able to differentiate between federated learning, differential privacy, homomorphic encryption as privacy-preserving techniques for data analysis. And then lastly, talk about some of the purpose and frequent challenges that are encountered when you're implementing common data models. So first, as we get into data security, if you're collaborating with a different organization, a company, have a business associate agreement, how do you know that they're treating the data securely? That's a very tough thing for all of us to keep up with, and so because of that, there are different bodies that will do certification. Some of these are ISO 27001, SOC 2, I'll talk about those in the next slide. If you get involved with some government work, you have to be familiar with HITRUST and FedRAMP, and then I'm gonna talk about some of these new technologies that I alluded to before. So ISO is the International Organization for Standardization and the 27001 segment is referring to standards for information security. SOC 2 is, SOC is the Service Organization Control, and SOC 2 is a framework that defines how companies will manage, process, and store customer data. These are not healthcare-specific bodies or certification. These are for all types of companies, and so what happens is, if you wanna get certified to show that you are treating your data with the right security principles and following this framework, you'll go through this, and I'll just give an example of kind of how ISO works. There's a framework that combines your policies and processes for your organization, and that comes into adopting what they call an information security management system. So every company that goes through this will have their own information security management system, and you can imagine, depending on what industry you're in, that can look very different. And so the idea is that you will set up these rules, how you're going to identify stakeholders and expectations for information security. You will identify risks for that information, define mitigation methods, kind of like we just heard about. They refer to those as controls. So what will you do when some of these known and expected risks come up? Have clear objectives for information security for your entire business. Implement those controls, and then show evidence that you're adhering to those and continuously monitoring performance of those controls. And then on top of that, have a continuous improvement process so as you look at how well you're doing, you keep enhancing and refining your process. So all that to say is there's not one specific kind of list of things that says this is how we know, they're following all the right security things. There's going to be this big process that's defined and analyzed, and then you get audited to show that you're in compliance with this, and that's how you get your certificate. So kind of moving past that, I'm gonna talk a little bit more about some of the more interesting things that for me at least is getting into data ownership and some of the techniques for collaborating. So shouldn't be any surprise to anybody here that data ownership is a big challenge, and a lot of health systems are going to be very resistant to sharing data openly. And some of that is just because of privacy concerns or litigation or things that could go wrong. Some of it unfortunately is also monetary, right, that there may be a lot of financial opportunities with some of the data, and sharing it more freely can undermine some of those things. So there's a lot of different reasons why this can be a challenge. And this doesn't even get into GDPR where there's much more restrictive use of data. So because of that, we've seen a real explosion of different techniques that are being innovated that will decrease the burden of data sharing, and I'm gonna go into some of those. So first one is a trusted research environment. So if you think about how things had worked in the past, if you have a secure data network, I know there are different research networks around the world, ICNARC in the UK had their data set, and if you wanted to work with that in the past, you would have to physically go into their system, be credentialed, there'd be isolated area where you can work in, and you couldn't take any data out, and you'd have to be in kind of a secure location. That's kind of similar, right, when you're in hospital settings, you can't just walk into a hospital and suddenly be able to get into all these private secure data. So what a trusted research environment does is it basically allows you to simulate being in this very locked down secure environment, but you can do it from anywhere. And this may sound like, well, isn't that just like VPNing in, or going into our system like I do from home? No, it's actually different, because it adds other security controls to this, restrictions, like you can't even copy and paste text, you won't be able to save data, some of them have restrictions even around screenshots. So it becomes a very secure environment that once you get credentialed, and you can access that, but it's almost like you're still working in a locked room where you can't take anything out. So there is a nice example of this, it was published, if you want to see how one of these was implemented. This comes out of this paper it's referenced here that was in BMJ, it's the Pioneer, the HDR UK Data Hub. And if you look at the flow of things, how they did this, you see they had their own internal hospital database, and they extract some of that data, and they also had some external data that they would extract, and then they would bring that data in and secure it onto a server. And then once they have that data prepped at sort of a first level, they would provide a description out to the public of that. So they'd have some description of what this data is, and that would allow other researchers to be able to see, okay, what is it that this data has? They're not actually getting at the data, but they get to see summary and description and data dictionaries, et cetera. And so they can go through this catalog and say, well, this looks like a good data set that I might want to do research on. And so they can apply for access to that data, go through a data trust committee, they'd have to do a review, and make sure these people understand how to use, go through human subjects research training, et cetera. And in parallel, they are transferring that data onto a cloud, making it research ready, and then at the final stage, they put it into one of these trusted research environments. And if someone's approved, then they can go access it there. All this to be said, again, technology, just like we talked about with telecritical care, none of these things work in isolation. You don't just say I have a trusted research environment, now I can just let anyone have access. You still need a whole program around that to make sure your data's secure. So this is one of the more interesting things I've learned about over the last several years, is federated learning. It's actually been around a while, but the AI version of this was really introduced in 2017 by Google. And they published this work, and okay, why was Google motivated to do federated learning? And the idea was they have their Android phones, and within Android phones, they have an app called Gboard, where people are texting and doing all your typing on your phone. They would create a model to help predictive texting with typing, predictive typing with their Gboard. And so they would deploy these models out to their phones, and then people would use them. But obviously, they want to enhance those models and make them more accurate. So if you think about how we traditionally do research, and historically, you want to just aggregate everybody's data onto a server. Can you imagine taking everything that you've typed on your phone and putting it on Google servers so they could analyze that in aggregate and be better at predicting typing? I mean, that's a pretty scary privacy issue. Nobody would really be okay with that. But then there's also just even technical things. Even if they just wanted to do this and not tell anybody, the bandwidth to move data across for everything someone was typing is just too much to handle, the amount of storage, et cetera. So what they did is they came up with this approach where they can deploy that model. The model will run locally on your phone and help do predictive typing for you. And it learns from your own typing on your phone and basically creates another iteration of that model based off of your own typing. And then what it sends back to Google is what it learned off of your data. So you're basically sending the equivalent of what we think about like coefficients and sends that back. So it doesn't actually ever have to send anything you typed back, but it sends back what it learns. And then it pulls that together with what it learned from the other typing on other phones, updates the model and pushes it back out and keeps enhancing and enhancing. So this is really kind of very innovative. This field has really exploded over the last several years because it provides a lot of opportunity for privacy-preserving learning. Next, I'll talk about differential privacy. So the idea here is, and this isn't always something on its own, but the idea is that you can help create some privacy by introducing noise to your data, right? So we know that just because you take out all the PHI elements, that doesn't mean somebody can't be identified. And so what you can do to sort of enhance that privacy is modify the data a little bit. So it's not exactly what it was in its raw form, but you keep it close enough to where the inference from your analysis is still valid. So the graph here is meant to sort of indicate, if you look at the lower right, that if you don't make any changes, don't introduce any noise, you'll be completely accurate, but the privacy support you've added to that will be negligible. As you introduce noise, the privacy goes up. Ultimately, you can change the data so much that you've protected everybody, but you don't have anything accurate. So it's about finding this middle ground and helping to prevent malicious actors from being able to re-engineer data and figure out what was actually there, even though you've hidden it. So last technique I wanna talk about is homomorphic encryption, which I think is a pretty fascinating new technology that's come out. And so the idea here is, what if you don't even trust sending any bit of your data to be analyzed by somebody else, whether it's de-identified or not? And so what homomorphic encryption allows is you can actually run analyses on encrypted data. So what does that mean? When you encrypt data, you basically, right, you're used to having some sort of encrypted data where you can encrypt a file. It now has a key with it. All the data in that file is now unusable, right? It's just a bunch of gibberish. And nobody can do anything with it until you unlock it with a key, get back to what it really says, and then work with it. Well, homomorphic encryption says, I can take this data, it started as five plus 10. Let's say this patient had five ED visits last year and 10 ED visits this year. And you want some analysis done on all these patients on their ED visits. Well, you don't want to expose how many visits these patients have, even if they're just a random identifier. So it gets encrypted, and you end up with something totally unrelated to numbers, right? X plus YZ is the ED visits. Homomorphic encryption lets you look at that encrypted data and still calculate what the result is. And then, so you can actually run analyses on data that's uninterpretable for everybody else. And then you can send that back and actually have the results. So this is really quite powerful because there's really almost no risk of what happens if the data gets leaked, gets out. Nobody can do anything with it when it's encrypted. Now, limitations are, this is very computationally intensive, right? It's, you're no longer just doing five plus 10. You're doing this extremely long string of things and trying to mathematically figure out. So it's very complex cryptographic techniques they use. Which I couldn't even begin to explain. But I've heard in some order of 10 times the compute power on average needed to run some analyses. And then on top of it, there's a limited number of analyses they figured out how to do this way. So you can do a lot of basic statistics. You can do some regression models. But you're not likely to be able to do a deep learning model or some random effects hierarchical model or things like that. They haven't really figured out how to do those things yet or have the compute power for it. All right, so now we'll shift into a bit about common data models. So the idea around common data models, when you wanna start sharing data and use it across health systems or partners, you really need to have data that's in, what we think about as the same structure and it has the same meaning. And so some of the early examples of this, I2B2, OMOP has become quite popular. It's been used for quite a while. I'm sure many of you are familiar with it. PCORnet came out from the National Patient-Centered Clinical Research Network. So all the hospitals participating in that have developed a PCORnet database that has their own data model. There's Sentinel Initiative that has their own data model. There's CDISC, which if you're gonna send in for regulatory things to the FDA. So there's all these different examples. And most of them are just very rigid with the structure. This is exactly how you have to format your data. And then these are the definitions of what that data comes in. And so then you're able to either share the data or you can take your models and go share it across your different sites and they can run the analyses locally. The problems with these are that there's a lot of variability in the scope and frequency of data updates. And when I talk about scope, what we found is there's these hospitals that say, oh yeah, we use OMOP and we use OMOP. But then you look at it and it turns out that this hospital used OMOP for one study on drug eluding stents in 5,000 patients. And that's their OMOP database. And this other hospital created a more general OMOP database and they're not really relevant and some of them update them daily, monthly. Some of them haven't updated them in years. So you end up with all kinds of variation in that. But then even more problematic is structurally everything looks the same, but the meaning may be quite different. And so you can imagine maybe you have different lab assays used at different hospitals. And so everything's labeled the same, but maybe the way you would interpret that value is different, but it's just been all normalized to where you think it's the same. Or say blood gases, you're supposed to have arterial blood gases all put into one field, but one place mixed up or has together their venous blood gases with their arterial blood gases. Now it all looks the same. As a researcher, you're assuming it's the same. Turns out they didn't really do it the same way. So now you've got a lot of problems with your data. Then lastly, I think this is a big problem as well as we talk about diversity, equity and inclusion and really being able to get good representation of data sources. And typically it's the large academic or more resource, less resource constrained larger facilities that have gotten involved with creating these common data models. Your smaller hospitals, smaller health systems, they just haven't bothered to invest in this. And so, and they tend to not have the incentive to. And so they get left out of all this research, right? So all the research is happening at the big academic centers and where all the practice is happening, they're just left out of it. And so this perpetuates a lot of problems with that. So because of some of this, one of the really interesting things has come out is this concept of using FHIR as a meta common data model. And so if you think about what FHIR is, these resources, you can look at your typical EMR and all the different data elements that are brought into there really at a very granular level could be stored in something like about 145 different data structures or tables. And so FHIR is actually very well organized. It's not just, you know, free for all, but it's very flexible. So it doesn't come in this rigid table format that we're used to, but everything is tagged and annotated. So you can see where you've got if data is, let's say with different terminology, you have diagnosis, whether it came from SNOMED or came from ICD, et cetera, things like that. So all that's there. And what happens is you can actually take your, some of these tools have been developed that will take OMOP and convert it into FHIR. And you can take PCORnet and convert it into FHIR. And that means if you have two hospitals or health systems that have built up different common data models, you can now align it into one and then use that to go forward because the incentive was already hard enough to build out the first data model. And then you start asking them to build out the second or third and you get a lot of resistance. And so this is kind of an interesting approach that will help with that. And then you can see your smaller hospitals that were never gonna do a common data model anyways have to be able to share data in FHIR. And so they could get involved by just sharing it in its native state. And if you ever convert to just using FHIR, which is much more granular and has a lot more flexibility, you don't have to worry about migrating all the old data because you already put it into FHIR. And so you can keep that historical data. So in summary, just talking about, reviewing what we covered, ensure that your business associates are certified. They adhere to cybersecurity best practices. Sharing data is possible, but aggregating into a single location is not feasible always and not always desirable. There's a lot of technologies that have advanced over recent years that support collaborative research without aggregating data. So some of those trusted research environments, federated learning, differential privacy, and homomorphic encryption. And a lot of these, you layer them on top of each other if you want. They don't all have to just be independent. And when you're sharing data across partners, you do need a common data model, but you really have to think about both structure and meaning. And the meaning can really be a trap door often where things look the same, but they're really not. So you have to have tight integration with those partners. Thank you.
Video Summary
In this video, the speaker discusses various topics related to data security, collaborative research, and common data models. They first talk about the importance of obtaining security certifications such as ISO 27001 and SOC 2 when collaborating with other organizations to ensure data security. They also mention other certifications like HITRUST and FedRAMP for government work. The speaker then introduces trusted research environments as a way to securely collaborate on research without physically accessing data. They also discuss federated learning, differential privacy, and homomorphic encryption as privacy-preserving techniques for data analysis. These techniques allow for analysis of encrypted or modified data without compromising privacy. Lastly, the speaker highlights the challenges of implementing common data models and suggests using FHIR as a meta common data model to align different models and ensure data compatibility across organizations.
Asset Subtitle
Crisis Management, Administration, 2023
Asset Caption
Type: one-hour concurrent | Challenges of the New Frontier: Tele-Critical Care (SessionID 1185615)
Meta Tag
Content Type
Presentation
Knowledge Area
Crisis Management
Knowledge Area
Administration
Membership Level
Professional
Membership Level
Select
Tag
Emergency Preparedness
Year
2023
Keywords
data security
collaborative research
common data models
privacy-preserving techniques
trusted research environments
FHIR
Society of Critical Care Medicine
500 Midway Drive
Mount Prospect,
IL 60056 USA
Phone: +1 847 827-6888
Fax: +1 847 439-7226
Email:
support@sccm.org
Contact Us
About SCCM
Newsroom
Advertising & Sponsorship
DONATE
MySCCM
LearnICU
Patients & Families
Surviving Sepsis Campaign
Critical Care Societies Collaborative
GET OUR NEWSLETTER
© Society of Critical Care Medicine. All rights reserved. |
Privacy Statement
|
Terms & Conditions
The Society of Critical Care Medicine, SCCM, and Critical Care Congress are registered trademarks of the Society of Critical Care Medicine.
×
Please select your language
1
English