Evaluation of Feature Selection in Handling Missing Syndromes
We compare the two feature selection methods in terms of the size of the reduced- syndrome set they provide. Since both boards show similar trends in the size of extracted subset, we present the results for Board 1 and Board 2 in Fig.6.4; In Fig.6.4, “Subset_M1” refers to the use of complete-case analysis to deal with missing syndromes during feature selection while “Subset_M2” refers to the use of label imputation to address missing values. First, we can see that for both M1 and M2, with an increase in the missing ratio, the size of the extracted syndrome set after feature selection increases first, decreases later, and eventually converges. One possible reason for this phenomenon is that since feature selection is used to extract a set of most informative features for a given board, when the logs contain missing
Fig. 6.4 Subset size of two feature-selection-based methods. a Board 1. b Board 2
syndromes, then some original informative features may no longer provide useful information; thus feature selection may have to include more alternative features in its extracted subset so that this extracted subset can still give satisfactory diagnosis accuracy. However, feature selection cannot find more appropriate alternative syndromes when the missing ratio is too high. Second, we can see that M2 preserves more syndromes than M1. This is because M2 applies label imputation to deliberately add extra information for missing syndromes while M1 only discards missing syndromes.