Data Curation Challenges for Artificial Intelligence
Ken Chang; Mishka Gidwani, Jay B. Patel,
Matthew D. Li, and Jayashree Kalpathy-Cramer
The last decade marks a significant leap in the capabilities of artificial intelligence (AI) algorithms, with the advent of powerful processing units, open-source frameworks for deep learning, the availability of large-scale datasets [1-3]. These algorithms are capable of learning the output(s) of interest from raw/preprocessed data for a variety of tasks such as image classification, speech recognition, and natural language processing [4-6]. More specifically, this state-of-the-art performance is accomplished through the chaining together of many layers of high-dimensional, non-linear transforms (i.e. convolutions, activations, pooling, etc.) which together form a network that allows for the learning of complex patterns with a high degree of abstraction. Within the medical domain, a logical application of this technique is to medical imaging, where clinicians have long noticed the relationship between imaging patterns and diagnosis, prognosis, genomics, and treatment response. As such, AI has brought on a paradigm shift in automated methods for medical image processing, with recent studies showing its potential utility for clinical applications within dermatology, ophthalmology, pathology, oncology, cardiology, infectious disease, radiation oncology, and radiology [7-12].
Although there are numerous possible tasks for AI algorithms within medical imaging, they generally fall into three main categories: classification, detection, and segmentation (Figure 14.1). Perhaps the most common task is segmentation, which may be considered the voxel or pixel-wise classification of anatomical areas into categories. In medicine, this is especially useful when the regions of abnormalities, tissues, and organs need to be precisely delineated, such as in the case of disease burden quantification, treatment response assessment, and radiation therapy planning [13-16].
Segmentation, referred to as contouring, is a critical step in development of a radiation therapy treatment plan. Members of the radiology and radiation oncology team including technicians,
FIGURE 14.1 Within medical imaging, there are three general categories of tasks: classification, detection, and segmentation.
dosimetrists, and radiation oncologists frequently delineate normal structures and lesions in a manual or semi-automatic manner. These structures in turn inform the creation of a dose map, which is intended to deliver the maximal dose to target sites while sparing normal tissue. This highlights the importance of accurate segmentation of these structures. Additionally, radiation therapy is frequently informed by characteristics of tumors which can be derived from the gross tumor volume (GTV) contour. These include the total volume (in voxels or milliliters), the maximal diameter (in pixels or centimeters), and the number of discontinuous lesions .
Advances in deep learning techniques for segmentation have been catalyzed in part by large- scale competitions [18-21], which provide a common framework for comparison and evaluation. Specific advances, such as the U-net architecture , ensembling , and adaptable input sizes  have shown to be particularly effective, elevating the performance of segmentation to the levels of inter-rater variability expected of human experts (Figure 14.2). Progress in methods development has also been expedited by availability of open-source code and software [25-27].
The Complexity of Medical Imaging
It is an understatement to say that medical imaging is complex. Although medical imaging is commonly stored in the Digital Imaging and Communications in Medicine (DICOM) format which harmonizes meta-information and images, there are still many sources of variation. Firstly, imaging can come in various resolutions and bit-depths. While regulatory agencies and device manufacturers recommend imaging parameters for stability and safety, many imaging settings are determined by the anatomy being imaged and the operator acquiring the image, to say nothing of environmental factors. Secondly, imaging can come in multiple views with the possibility of supplemental views in scenarios where the imaging contains artifacts, ambiguity, or suspicious pathology . This can complicate the harmonization of segmentation labels, which may be superimposed on a single view. Imaging can also be acquired in a 2D format, such as x-ray or histopathology, a 3D format such as CT, magnetic resonance imaging (MRI) or ultrasound, or even a 4D format such as free-breathing CT (4DCT) and cine cardiac MRI. Furthermore, images can be stored and displayed as either grayscale (such as radiographs and computed tomography, CT) or color scale (positron emission tomography and doppler ultrasound), depending on modality . Finally, the native dimension of the images directly influences the construction of a deep-learning algorithm. High resolution multimodal 3D imaging must be treated differently to low resolution unimodal 2D data, primarily due to graphics processing unit (GPU) memory concerns. Even when considering only 3D imaging formats, one must specifically design networks to handle differences between isotropic and non-isotropic data. Moreover, other factors such as convolutional filter size and filter numbers must
FIGURE 14.2 (A) U-net architecture for image segmentation. The network is composed of an encoding arm, which repeatedly downsamples the input into a lower dimensional space, and a decoding arm, which recovers the output of interest. The skip connections between the encoder and decoder allow for feature re-use, which helps the network combine low-level with high-level information. (B) A schematic showing model ensembling across n models. The outputs of the individual models are averaged together to create a final, more refined prediction. (C) An example of a segmentation network that can accept variable input sizes, which can be useful for learning features at different scales.
change accordingly with image dimension. Operations that decrease the dimension of the image, such as pooling layers, need to be carefully tracked to ensure that spatial information is preserved when expected.
Image intensity values can have units with physical meaning (such as the Hounsfield unit in CT) or be unit-less such as with conventional MRI. Studies can also have multiple sequences that need to be interpreted in combination such as diffusion weighted imaging (DWI) and apparent diffusion coefficient (ADC) MRI sequences. Because of this, intensity values cannot be directly compared across sequences. For example, while an anatomical feature may be contrast-enhancing under certain imaging conditions (e.g. T1 post-contrast MRI), it may be more absorptive under different conditions (e.g. T2 MRI). When using a contrast-enhancing agent, or a molecular imaging probe or fluorophore, the challenge of comparing images is compounded. Standardizing concentrations of reagents and imaging parameters such as field depth and exposure, as well as normalizing by the size of the targets of the imaging (e.g. number of cells for molecular imaging) can mitigate some of these differences. The diverse distributions of image intensity values call for standardization during preprocessing, as deep-learning models are more accepting of uniform values as input.
The research setting often involves imaging across scales and model systems. The necessary technical differences when imaging cell culture, organoids, xenograft models, and humans, can limit the ability of the researcher to compare images across scales. In the clinical setting, images from different modalities often need to be considered when making a diagnosis. Additionally, disease can present across multiple anatomical regions, such as primary and metastatic disease. Lastly, the longitudinal tracking of disease is often critical for evaluation and current imaging must be compared with prior imaging . When multiple scans are required for the development of a machine learning model or for a granular view of the pathology of interest, the images acquired are not independent and the effect of repeated imaging on the subject should be ascertained. This may involve movement of the subject, loss of contrast agent or fluorophore, or the molecular effects from ionizing radiation, among others. Unifying imaging information across scales, modalities, physiological sites, and time are some of the many challenges facing the data collector.
The Challenge of Generalizability and Data Heterogeneity
One critical hurdle that prevents the widespread utilization of deep learning models in clinical workflows is the lack of generalizability (or transferability) of trained models across datasets and institutions [30-33]. Indeed, this fact is masked within the medical literature, as very few published studies have external validation . In general, the lack of generalizability is attributed to data heterogeneity, that is divergence between the distribution of data that was used for training and the distribution that was used for evaluation.
Data heterogeneity can stem from several causes. Firstly, there may be differences in patient demographics (such as age, sex, and ethnicity) and disease prevalence between different institutions. The imaging acquisition can also vary. For example, different mammography systems can have different X-ray tube targets, filters, digital detector technology, and control of automatic exposure . Along the same line, MRI at different institutions may utilize different field strength, resolution, and scanning protocols. In fact, MRI acquisitions can differ even within an institution if there are machines from multiple different manufacturers. Lastly, there can be variability in the labeling by human annotators, a phenomenon that has been documented across many medical disciplines [36-38] (Figure 14.3).
FIGURE 14.3 Various types of heterogeneity can exist in real patient data, such as (A) imbalanced labels or patient characteristics, (B) different image acquisition setting or scanner systems, and (C) inter-rater variability in labeling.
The choice of a well-defined sample population is important when beginning data curation. In order to obtain primary medical data of a pathology of interest, an authorized user must access the electronic medical record (EMR) with the approval of an institutional review board (IRB). This may not be the researcher themselves. Therefore, this requires clear communication and definitions of the cohort of interest. The authorized user further needs to be able to search and harvest data from the EMR, which is complicated by heterogeneously tagged data and differing formats. When medical data is downloaded and saved, often in an anonymized fashion, relevant information may be lost, such as prior treatments, comorbidities, and genetic characteristics, all of which may have influence data analysis. Importantly, the DICOM header may contain useful information such as the resolution, orientation, and acquisition settings that may be stripped away during anonymization. One other aspect of data selection that should be mentioned is concept drift, that is how the classification of disease (and thus, relevant annotations) changes over time as knowledge of medicine evolves over time. In addition, there can be technology shift, where newer imaging systems replace older ones. It is important that data selection captures the most recent clinical standards in order to ensure trained algorithms can be used prospectively. As such, data selection, curation, and annotation must be a continuous process. Broadening this statement, the training and testing data should be similar. If the training data significantly predates (or differs in other ways) from the testing data, the performance of the algorithm may degrade. Additionally, it is critical that the positive and negative cases are acquired under the same conditions. If not, the algorithm may learn differences in image acquisition instead of the disease of interest, achieving deceptively high performance .