The Need for Large Quantities of Data

The large quantities of data needed to train effective, robust models is mainly driven by this issue of data heterogeneity. This data requirement is further increased in use-cases where there is a high amount of anatomical and disease phenotypic variability. For general computer vision tasks, public datasets can be quite large (on the order of Ю5 - 109 samples) [2, 40, 41]. Conversely, publicly available medical imaging datasets are considerably smaller (on the order of 10' - 105 samples), and thus it becomes much more challenging to meet the data requirement [20, 21,42,43]. Furthermore, using publicly available data can introduce additional complexity. For example, the data may come with incomplete annotations or may have already been processed via some unknown transformations (which is effectively equivalent to adding unwanted noise to the input data distribution). Indeed, various studies have shown that the performance of algorithms improves substantially with the incorporation of more training data [44, 45]. Importantly, the size of the dataset includes both the absolute quantity of data but also the number of images from patients that have the pathology of interest. For example, a very large dataset that only has a very small percentage of patients with disease may not be effective in training an algorithm as it may not be exposed to adequate phenotypic diversity. That said, curating matched “normal” data to serve as algorithm controls can likewise be limited due to the infrequency of healthy patient visits as well as the high prevalence of abnormalities that are not clinically significant [46]. Moreover, enough data to not only train the network, but also perform both internal and external validation to assess for model generalizability needs to be acquired.

In cases where a minimum acceptable threshold of data is unable to be curated, pre-training can be a powerful tool. Pre-training is when a model is initially trained on a large, diverse dataset on a tangentially related task before being fine-tuned on the smaller dataset of the task of interest. In general, the upper layers of a neural network learn only generic, non-task-specific information (i.e. edge filters, shape detectors, color filters, etc.), and thus can be transferred to other tasks without modification. Indeed, it is common to only fine-tune the final layers of the network on the new task of interest, freezing the rest of the network to preserve the pre-trained weights. This process of

Examples of common spatial and intensity data augmentation transforms that can be applied to imaging data

FIGURE 14.4 Examples of common spatial and intensity data augmentation transforms that can be applied to imaging data. It is common to compose these transformations together to generate large amounts of variation from a potentially limited training dataset.

utilizing large quantities of related data (also known as transfer learning) is an effective paradigm within medical imaging and has been shown to improve performance by allowing the network to learn domain-specific imaging features without needing to learn generic filters [47,48]. Pre-training can also be important with the occurrence of concept or technology shift, whereby historical data can be used for transfer learning.

Regardless of access to the absolute quantity of data, training can still be improved with careful application of methods that artificially augment the dataset. These methods rely on manipulations to the existing data to increase its diversity, such as random spatial transforms (translations, mirroring, scaling, rotating, elastic deformations) and intensity transforms (gamma corrections, saturation, noise) [24, 49] (Figure 14.4). Other modulations such as random jittering, kernel filters, and erasing can be applied as well [24,50-52]. One counter-intuitive approach that has been shown to be effective is to mix, or average, two images together, which allows the generation of N2 more training images from N training images [53, 54]. More advanced techniques include using neural networks to learn optimal augmentation policies, making alterations to the image via style transfer, or generating new training images [55-58]. While it is expected that augmentation will generally improve the general- izability of the model, this is not necessarily the case for medical imaging [49, 50]. Specifically, one needs to ensure that the chosen augmentations are physiologically possible. For example, the heart contracts and relaxes in a very regular pattern, and applying random elastic deformations that are not carefully constrained can produce augmented images that lie outside the true data distribution, which may in fact lower performance. Data augmentation is explored further in Chapter 11.

Barriers to Sharing Patient Data and Distributed Deep Learning

Another major hurdle in data curation is the difficulty in sharing patient data, specifically in dealing with the concerns of patient privacy, data anonymization, patient consent, intellectual property, and data storage requirements. Firstly, protection of patient privacy and confidentiality is of critical importance both within medical care as well as in research. Indeed, studies have shown that the leakage of just a few clinical variables or a single image can allow for reidentification of patients [59-61]. One approach to prevent leakage of information is to convert DICOM images into other formats, such as Joint Photographic Expert Group (JPEG), Neuroimaging Informatics Technology Initiative (NIfTI), and Portable Network Graphics (PNG), which removes identifiable information that is present in the DICOM header [28]. In addition, other potentially identifiable information can be removed through various approaches including defacing of imaging data, anonymization of clinical reports, and removal of patient health information imprinted into the image itself [62-64]. Despite the efforts in automation, there is still the possibility of information leakage and there is still the need for laborious manual audits. For example, there can be identity leakage from accessories such as necklaces and wristbands [28]. In cases where there are no automated methods for patient de-identification, manual auditing and anonymization can be prohibitively expensive especially for large-scale datasets. Additionally, depending on the institution and their IRB, patient consent may be needed to share the data which adds an additional barrier [65]. Furthermore, data may be regarded as a valuable commodity and institutions may simply prefer not to share patient data with external groups due to the interests of the organization. Lastly, there may be a high cost of data storage, especially given the increasing utilization of high-resolution imaging with multiple modalities [66].

One alternative approach to sharing data is to train deep learning models via a distributed learning approach. Under this paradigm, the deep learning model weights, updates, or intermediate outputs are shared instead of the patient data. This would alleviate the need for full data anonymization and would eliminate the need for a secure central database. With this approach, each institution installs software that allows them to connect to other institutions for collaborative training. Techniques such as cyclical weight transfer, federated learning, and split learning have shown the potential to achieve performance comparable to sharing patient data [67-72]. Recently, proof-of-concept studies have demonstrated the utility of federated learning for brain tumor segmentation. [73, 74].

Data Quality Issues

Data quality issues are detrimental to all medical imaging applications. These issues can include motion artifact, reconstruction artifact, low signal, noise, out-of-focus imaging, low-resolution, operator error, technique limitations, and physical artifacts [75-78]. Additionally, conventional imaging formats that are not or historically have not been rendered digitally, such as pathology slides or X-rays, can result in batch differences that the researcher must address [79]. The use of an imaging standard, commonly called a “phantom”, can help differentiate issues in image quality stemming from the equipment and protocol from those arising from the subject matter. Registration to a template of fixed anatomy after an image has been taken can help identify deformations such as rotation, blurring, or shearing.

As data is curated at scale, there will inevitably be low-quality images within the dataset. Just as a radiation oncologist cannot perform treatment planning on a severely motion-corrupted image, effective deep learning algorithms cannot be trained using low-quality images. Thus, it is of vital importance to be able to identify, and if possible correct, issues with low quality samples in the dataset. Identification of such samples can be done via out-of-distribution detection methods, which aim at selecting samples that by some metric are classified as outliers [80]. Another solution is to utilize novel deep learning uncertainty approaches, which aim at flagging samples that the algorithm is not confident in [81-85].

After identification, correction should be applied when possible. For example, low resolution imaging/annotations should be appropriately resampled via spline interpolation (ensuring that annotations are resampled only using a spline of order 0). More modern approaches to this problem involve domain-specific content-aware resampling (e.g. neural network trained specifically to upsample low-resolution brain images to a higher resolution) [86]. Such methods come with the same caveats of model brittleness and lack of generalizability that have been mentioned previously, and so should be used with caution. Other algorithms are capable of removing burned-in text, such as that from an ultrasound [64].

Overall, numerous algorithms have been developed to correct for artifacts or improve the resolution of images and it is the duty of the researcher to identify which of these algorithms or methods will work best on their data [87, 88]. If all else fails, then manual inspection of the imaging may be needed to remove them from the dataset.

Data Annotations

One important part of training a supervised deep learning algorithm is the need for ground truth labels. While the preferred ground truth would be manual expert annotations, these can be difficult to acquire at scale. Specifically, labeling should be performed by trained clinicians with domain expertise and experience (which can include sub-specialty training and years of specialty practice). This labeling can be time-consuming and expensive to utilize. Unsurprisingly, studies have shown that annotators with more domain experience label more accurately than those with less experience [89-91]. A major challenge to manual expert annotation is the inherent human variability [36-38]. Even under highly controlled settings with well-defined annotation criteria, there will still be variable distributions of class frequencies across users, with some experts being more conservative and others being more liberal in their annotations (an observation known as inter-rater variability) [28]. Furthermore, certain annotators may even exhibit poor self-consistency (an observation known as intra-rater variability). Both inter-rater and intra-rater variability can weaken the ground truth labels, negatively affecting training due to the added noise. Additionally, these forms of variability can affect the generalizability of the model, since the “ground truth” in the external validation set may be produced differently than the “ground truth” for training.

Another approach is to use natural language processing algorithms to extract labels from clinical reports. Studies have shown that natural language processing approaches allow for accurate and high throughput extraction of labels from unstructured narrative reports [92-95]. Another alternative for high-throughput annotation is through citizen science and crowd-sourcing [96-99]. By decreasing the annotation burden on any individual, this approach is scalable. To ensure high quality annotations, consensus and verification approaches can be utilized [100].

As an alternative to requiring high quality annotations for the entire dataset, using methods under the umbrella of weakly supervised learning may also be considered [101]. Weakly supervised learning reduces the annotation burden by combining information learning from gold- standard labels with that of unlabeled or weakly labeled information during the training process [101]. There are three major types of weakly supervised learning: incomplete, inexact, and inaccurate. Incomplete weakly supervised learning is a scenario in which only a small subset of the entire dataset is labeled [101]. An example of this is active learning, where the algorithm suggests which images, if labeled, would be most informative to training the neural network. Ideally, this would identify highly difficult or representative cases, either of which would help the algorithm learn [102]. Another example is semi-supervised learning, where the algorithm exploits unlabeled data by learning from the small subset of labeled data with the assumption that both the labeled and unlabeled data come from the same distribution [101, 103]. The second type of weakly supervised learning is inexact supervision, in which a coarse label is used for a more granular task. For example, image-level annotations can be utilized in a weakly supervised algorithm to produce pixel-level predictions [9, 104]. The third type of weakly supervised learning is inaccurate supervision in which there are non-negligible errors in the labels [105]. For example, labels that are automatically extracted from clinical reports using imperfect algorithms can still be used to train high performing algorithms [106, 107].

< Prev   CONTENTS   Source   Next >