Data Curation via Competitions
Competitions can also be an effective means of compiling multi-institutional and multi-national datasets. In this approach, multiple institutions work together to collect, prepare, anonymize, pool, and annotate the dataset [96, 108-114]. This is part of a broader effort to incorporate citizen science for annotation of experimental datasets [99, 115, 116]. There are several advantages to creating a dataset from a competition approach. First, the resulting dataset is substantially larger than a dataset that can be curated at any single institution. Additionally, it is efficient because similar data curation workflows can be used across all institutions. Also, the final dataset is diverse in terms of patient populations and acquisition settings, which allows for robust training of algorithms and evaluation of the generalizability of algorithms across these dataset differences. Furthermore, a competition framework allows for fair and direct comparison of the performance of different algorithms. Lastly, competitions facilitate collaboration via open datasets and open-source code with the shared goal of integrating tools into the clinical workflow and improving patient care. This is catalyzed by social media, online forums, blogs, and preprint servers, all with the culture of sharing insight and experience. Notably, there has been cross-pollination of participants from a variety of backgrounds, including clinicians, trainees, computer science, engineering, and data science . Some recent examples of datasets curated using the competition framework include the RSNA Brain CT Flemorrhage , Pediatric Bone Age , and Pneumonia Detection Challenges  , and the 2017 AAPM Thoracic Auto-segmentation Challenge used throughout this book.
Bias and Curation of Fair Data
While it is well documented that minorities are disproportionately underrepresented in clinical trials and population health profiling studies [l 17, 118], these inequities also extend to medical AI. Since AI models learn the characteristics of training data, if the data provided are not equitable, they will not generalize well to unseen or minority classes. A study of published AI systems built on publicly available X-ray data found worse performance on minority genders when the model was not trained with a minimum gender balance . When designed fairly, AI systems can mitigate inherent bias for race, age, gender, sexual orientation, and socioeconomic status. Proposed methods of bias mitigation include minimum quota of minority populations, subsampling of these data, careful consideration of label definitions of “normal” and “abnormal”, and inclusion of minority populations in AI systems design . Another source of bias mitigation is the use of an external validation cohort, as data gathered from a single institution represent a sample of the patient population served and are therefore subject to geographic bias which can span demographic stratifications. Finally, thorough ethical review during problem formulation, model development, and system deployment can increase awareness of bias even when it cannot be corrected  (Figure 14.5).
Overview of Data Curation Process
As an overview, there are several key steps in the data curation process. The first step is selection of pertinent data that is both diverse and captures the most recent clinical standards. Importantly, effort should be made to avoid biases from factors such as race, age, gender, sexual orientation, and socioeconomic status. The second step is obtaining appropriate approval to share patient data and ensuring patient privacy via anonymization. Alternatively, a distributed learning approach may be considered in scenarios where sharing patient data is difficult. The third step is handling data quality issues via correction of artifacts and removal of low-quality images. The last step is data annotation, either delineated manually from experts, extracted from clinical reports, or crowd-sourced from competitions. Alternatively, weakly supervised approaches can be considered in scenarios where labels are incomplete, inexact, or inaccurate.
FIGURE 14.5 (A) Distribution of mammographic breast density varies by race showing that different races
have different attributes for the digital mammograms in the Digital Mammographic Imaging Screening Trial (DMIST) . (B) Testing set performance of a deep learning model for mammographic breast density trained on DMIST. demonstrating that performance can be different across different races. Bias in deep learning models can occur when populations are underrepresented in the training set.
In summary, deep learning has changed the way researchers approach medical imaging analysis. Deep learning has brought on remarkable performance, on the level of human experts, but also requires large, high quality, and diverse datasets. The curation of such datasets is challenging, and factors such as the complexity of medical imaging data, patient privacy, data quality issues, and annotation must be considered. Automated anonymization methods and distributed learning can serve as two different approaches to protect patient privacy. Automated methods can also be used to flag and correct for data quality issues. Collection of high-quality annotations can be time-consuming and expensive but has been partially addressed by the advent of natural language processing and weakly supervised learning algorithms. Lastly, competitions are a promising framew'ork to facilitate collaboration in the construction of large, multi-institutional datasets. When beginning a medical imaging project, the clinician or researcher should bear in mind that the data corpus collected determines the quality and confidence of the results. In radiation oncology, which is an imaging-driven discipline, amassing the data may not be rate-limiting, but care should still be given to data curation techniques in order to train fair, highly performant, and gen- eralizable deep-learning algorithms.