Background: Colorectal cancer (CRC) is the third most common cancer worldwide. Cancer discrimination is a typical application of gene expression analysis using a microarray technique. However, microarray data suffer from the curse of dimensionality and usual imbalanced class distribution between the majority (tumor samples) and minority (normal samples) classes. Feature gene selection is necessary and important for cancer discrimination.
Objectives: To select feature genes for the discrimination of CRC.
Methods: We improve the feature selection algorithm based on differential evolution, DEFSw by using RUSBoost classifier and weight accuracy instead of the common classifier and evaluation measure for selecting feature genes from imbalance data. We firstly extract differently expressed genes (DEGs) from the CRC dataset of the TCGA and then select the feature genes from the DEGs using the improved DEFSw algorithm. Finally, we validate the selected feature gene sets using independent datasets and retrieve the cancer related information for these genes based on text mining through the Coremine Medical online database.
Results: We select out 16 single-gene feature sets for colorectal cancer discrimination and 19 single-gene feature sets only for colon cancer discrimination.
Conclusions: In summary, we find a series of high potential candidate biomarkers or signatures, which can discriminate either or both of colon cancer and rectal cancer with high sensitivity and specificity.
Keywords: Colorectal cancer, feature genes selection, discrimination of cancer, imbalanced data