Machine learning (ML) is a subfield of artificial intelligence that focuses on pattern recognition and computational learning. ML constructs algorithms to learn from and make predictions on data, so that we can make decisions by data-driven instead of static program instructions. With the new innovative technology in our lab, we are producing high-throughput multi-dimensional data for the fingerprinting of cells (see RT-DC), and in collaboration with the Biomedical Cybernetics group at BIOTEC, we are developing ad-hoc procedures for ML analysis. The challenge is to make the maximum use of the big data and we apply multiple ML techniques to address different data mining tasks. Depending on the usage scenarios, ML can be mainly categorized into unsupervised and supervised ML. The main difference is that there is no explicit label information for the unsupervised ML, while supervised ML is trained with label information (like health/disease cell, positive/negative cell, etc.) for prediction. By using unsupervised ML methods such as principle component analysis (PCA), we can investigate the presence of hidden patterns in the data, including the unexpected presence of different subtypes of a pathological condition or the challengeable identification of hidden sub-populations in the data. For supervised ML, for instance, given the label information of cell cycle phases (mitotic/non-mitotic), we use ensemble learning to train the machine to obtain the decision rule and to use that to predict new unknown samples.
Despite the theoretical distinction of supervised and unsupervised ML methods, we combine them to address practical tasks. A typical workflow in our lab starts with unsupervised ML to explore the patterns hidden in the data, and validate the putative different cell populations. Once the unsupervised learned label information is confirmed, it is used to implement a supervised procedure for the characterization of new unknown cells. In the near future, we will also employ state-of-the-art deep learning techniques to carry out the analysis from the raw image data. Meanwhile, considering we have multiple features in the RTDC output data, we are also using feature selection to tell which feature combination yields the most valuable information, e.g. the most distinct separation of the heterogeneous populations.