The huge data sets produced by high-throughput screening (HTS) technologies have created a tremendous challenge for the drug discovery industry. Rapid processing of HTS data and identification of hits are essential in order to accelerate the discovery of quality lead compounds. In addition to finding active compounds among those screened, it is useful to identify the molecular features associated with the activity. To do this, one needs to analyze the initial HTS data to find quantitative relationships between biological activity and specific compound features. There are several challenges in the development of biological activity models from HTS data. First, the hit compounds belonging to different chemotypes may be acting via different mechanisms. Second, many HTS data sets have substantial measurement errors. Third, despite of large exploratory sets which may include thousands of compounds, HTS programs usually provide relatively few active compounds. Powerful and flexible data management systems are key to addressing these challenges. In this review, we elucidate the modern approaches to processing HTS data and developing biological activity models. In our opinion, such systems provide a functional interface between real and virtual screening programs. The synergy of these powerful technologies will increase the efficiency with which high quality clinical candidates are produced, thus providing a great benefit to the industry.
Keywords: High-throughput screening, data mining, quality control, machine learning, visualization, chemogenomics