Synthetic datasets can provide the health service with better AI models

    ()

    sporsmal_grey_rgb
    Article

    Producing synthetic patient data reduces the risk of privacy violations when AI tools are introduced. However, synthetic data can involve other types of risks.

    EU's new Artificial Intelligence Act (1) defines artificial intelligence (AI) as a machine-based system that operates with varying levels of autonomy and infers, from the input it receives, predictions, content and recommendations. Article 2 (7) of the AI Act refers to the GDPR for its provisions on processing of personal data (2).

    To ensure that AI tools for use in patient treatment are accurate and safe, providers need large amounts of data to develop and train their products. In addition, the health services need data to validate and test the AI tools, or to adapt them to local populations (3).

    Synthetic data can be produced on the basis of real patient data with the aid of generative AI models and can consist of test results, radiological images and patient record notes that look real, even though they are not

    Synthetic datasets are now being produced for these objectives, and can consist of pictures, sounds, tables, time series (2). Synthetic data can be produced on the basis of real patient data with the aid of generative AI models and can consist of test results, radiological images and patient record notes that look real, even though they are not. There is widespread optimism in this area when it comes to deep generative models that most of us now know from ChatGPT.

    In theory, synthetic data cannot be linked directly to individuals, allowing fewer restrictions on processing (4). The EU General Data Protection Regulation (GDPR) (5), which has been incorporated into Norwegian legislation, defines all information that can be linked to individuals as personal data. All personal data that are used for the development of AI, including those used to produce synthetic data, are therefore regulated by the GDPR. Other relevant regulations include the duty to ensure the right to privacy and confidentiality, set out in Section 102 of the Norwegian Constitution, and Section 3 - 6 of the Patient and User Rights Act (6).

    Moreover, AI models for the health services must be trustworthy and ensure patient safety, for example by making the correct diagnosis or proposing an effective treatment. These are arguments in favour of permitting access to personal data and health information.

    However, the access to personal and health data is restricted by legal provisions as well as scarcity of data. The lack of health data tends to be most conspicuous when it comes to rare conditions and illness in children. For these, synthetic data may represent the best as well as the fastest opportunity to obtain sufficient data for the development of new AI models (7).

    The duty of confidentiality

    The duty of confidentiality

    The purpose of the duty of patient confidentiality in the health services is to permit individuals to seek out health assistance with confidence that their personal information will not be divulged or made available to unauthorised parties (8). Public health registries could be a basis for generating synthetic data that represent far less of a challenge to the duty of confidentiality and the trust in the health services. Considerable amounts of health data are currently stored in the national quality registries for medicine and other health registries. Registry data with patient information can be transferred for secondary purposes as defined by the Health Registries Act, and these data are also subject to the duty of confidentiality (9).

    Data that are encompassed by the duty of confidentiality can be exempted from this duty and put to use if it is unlikely that they can be linked to the person in question, and if the benefits and risks of this are proportionate. Synthetic data that are not related to individuals are not personal data and hence not subject to the duty of confidentiality.

    To enable assessment of the identification risk and risk-mitigating measures, the AI model must be transparent and open about the data that have been used (10, 11).

    Risks and challenges

    Risks and challenges

    Synthetic datasets must be representative of the population on which they are intended to be used. Quality control is required throughout the entire synthetisation process, from control of the original dataset, measurement of statistical similarities between the training dataset and the synthetic dataset, to testing of the performance of a model which has been trained on the dataset (11). Otherwise, the AI models can introduce new risks for patients when used in diagnostics and treatment.

    The use of generative AI models to produce synthetic data is resource-intensive and energy-consuming, and this has consequences in terms of the environment and sustainability (12).

    Biases in the training dataset can inadvertently be reinforced through the synthetisation process. Generative methods that fail to capture underrepresented minority groups in the original data may lead to discrimination of persons and patient groups, because the algorithms in the AI models are insufficiently accurate and reliable for specific sub-groups (13). On the other hand, synthetisation may be used to counteract discrimination, for example by correcting for original biases in the generative process.

    Protecting personal data

    Protecting personal data

    The risk of identification of individuals can be minimised through the use of synthetic data, and the opportunity for freer use and sharing can be extremely beneficial for the health services. When generative AI models are optimised to produce synthetic data with the greatest possible similarity to the original dataset, there may still be a risk of identification. Some models can produce exact copies of parts of the original datasets or data points that are materially identical, even when there is no one-to-one relationship between the real personal data in the training dataset and the synthetic data points. This is referred to as residual risk of identification in the synthetic dataset (11), where the residual risk increases with the degree of knowledge of the generative methods (14).

    A pivotal question in the generation of synthetic data is identifying the line between when the data are to be considered personal data and when they are not, allowing processing outside the material scope of the GDPR or the duty of confidentiality. After all, a key objective of synthetic data generation is to reduce the data privacy risk in order to facilitate processing. The legal limit to identification risk is based on Section 4 (1) and recital 26 of the GDPR and various other legal sources (15–17). The likelihood of identification of persons is a key issue.

    The possibility of identification when the data are generated into synthetic data will vary in terms of the complexity of the dataset – the number of variables and patients, statistical outliers, or the generative method used as well as the access to other relevant information

    Unfortunately, it may be possible to infer personal data from synthetic datasets (18). The possibility of identification from synthetic datasets will vary depending on the complexity of the dataset – the number of variables and patients, statistical outliers and the generative method used, as well as the access to other relevant information. The risk of identification also depends on how resource-intensive that process is, and on whether safety precautions have been applied during the generative process.

    Safety precautions that are intended to prevent identification from synthetic data may increase the risk that the dataset is no longer representative, which in turn may reduce the adequacy and utility of the data and the AI tool (19).

    In light of consequentialist ethics, the social utility of introducing AI tools in the health services could justify a higher risk of identifying individuals

    Ethical concerns

    Ethical concerns

    Ethical concerns and principles form the basis for parts of the legislative acts and can be included in legal deliberations, for example in assessments of reliability. Consequentialist ethics assumes that a decision is ethical if it optimises the level of positive consequences as a whole (20, 21). This implies that decisions must be assessed holistically for an opinion to be formed about their ethical tenability. The totality of positive effects must be balanced against the total damage that an action has caused (21). From the perspective of consequentialist ethics, the social utility of introducing AI tools in the health services could justify a higher risk of identifying individuals. A further consideration relates to the utility that the data may have for future generations – the consequentialists ascribe considerably higher importance to future generations (22). This contrasts with the world view of classical economics, where future utility has lower value than current utility, so-called discounting of future benefits.

    From a duty ethics perspective, the duty toward future generations may also have a bearing. In a legal context, such perspectives can be encompassed by social considerations. Theories of duty ethics differ from consequentialist ethics by attaching more importance to the value, autonomy and rights of the individual than to social utility as a whole (21). A framework of duty ethics stipulates a positive duty to help others and a negative duty not to inflict damage. In the case of AI models, utility must be viewed more broadly than concerns for the individual, since this may benefit many people, including in the future. Social attitudes and the risk that people are willing to accept to achieve a collective good may also be relevant considerations. The acceptance of risk is expressed, for example, in surveys on the willingness to make personal data available for research purposes in order to benefit others (23, 24).

    For synthetic data to help address the challenge involved in access to health data for development of AI for the benefit of our patients, it is presumed that these data can be processed either as anonymous data or under a legal exemption. If the data are treated as personal information, their use will be restricted and their utility greatly reduced. The fact that the data are synthetic could in itself be regarded as a risk-mitigating measure. The access to large amounts of data through the use of synthetic datasets could also make the health sector less dependent on global technology companies that possess vast amounts of locked proprietary data for use in the development of AI.

    Special regulations for synthetic data in the health sector, as well as clarity in the government requirements for approval of equipment trained on such data, could provide more predictability for developers and producers of AI models, and also for users and patients in the health sector.

    Patients already need to accept a certain level of risk when benefitting from health services. A topic that so far has been insufficiently focused concerns the kind of data protection risk we as a society believe individuals should accept as a consequence of these data being used as a basis for improving access to future modern medical treatment, for the benefit of society in general.

    Comments  ( 0 )
    PDF
    Print
    Reply to article

    Recent Articles