User Tools

Site Tools


bioinformatics_essay25

Machine learning made easy

Supervised (require labels)

  • random forests (RF),
  • support vector machine (SVM)
  • k-nearest neighbors (kNN)

Unsupervised (label-free, clustering, feature extraction)

  • principal components analysis
  • k-means clustering
  • self-organizing maps
  • Reinforcement learning

Feature-based and Artificial neural network (ANN)

  • RF and SVM require explicit specification of various features
  • convolutional neural networks (CNNs) and recurrent neural networks (RNNs) can extract features from the training data by themselves

Machine learning nowadays is a word you cannot bypass. From the autocorrect applications on cell phones to self-driving cars to facial recognition, personalized medicine, and precision agriculture, machine learning plays important roles everywhere. This is similar to the emerging subject of Bioinformatics, which you cannot bypass in every biology research lab. But why it is still mysterious when people mentioning words like supervised training, cross validation, parameters optimization etc. It is probably due to the requirements of strong backgrounds from statistics, computer science and biology.

As the brightest jewel from the crown of Bioinformatics, machine learning will be more and more common and convenient for researchers to use and handle. At most of time, machine learning is to deal with large dataset, predict patterns, and cluster groups. For supervised training, they are frequently used for the purposes of binary/ multi-class classification of test instances or for numerical prediction of the trait values (regression) and require explicit definitions of labels. While unsupervised training is label-free and are primarily used for clustering and feature extraction. To start your project, it is important to figure out what the question it is? Is it a classification problem or regression or clustering or dimensionality reduction? Classification is a type of supervised learning in which the goal is to identify (i.e., classify) samples into one of several known categories. Clustering is a type of unsupervised learning in which the goal is to partition the data into groups that are composed of similar samples. Once the benchmark dataset has been spited into test data and train data, they will be fitted in different models to evaluate the performance. Since the data points in the test set may not be representative of the model's ability to generalize to unseen data. To combat this dependence on what is essentially an arbitrary split, we use a technique called cross-validation. To choose the best-performing machine learning model, multiple algorithms and their parameter combinations using measures will be compared, such as the area under the receiver operating characteristic (AUROC) curve, precision-recall(PRC), and F-score. Machine learning

In a project of predicting highly similar duplicate genes (HSDs), the Linear Regression and KNN(k-nearest neighbors) models were used from the supervised machine learning as classifiers to train and predict those datasets who belong to HSDs. To realize that, Scikit-Learn as an efficient machine learning tool in Python was used. I first create a table including the targets (dependent variables) and features (independent variables). Those duplicates satisfying HSDs patterns are labelled as 1, others will be labelled as 0. Then the data was loaded by using Pandas, and processed by supervised learning estimators such as Linear Regression and KNN models. In selected benchmark dataset, 75% groups were used as training datasets (model fitting), and the rest was used as test data to evaluate the performance of predicted data (X_train, X_test, y_train, y_test ). Cross-Validation (CV) was also used to evaluate the model performance by randomly split into several(e.g., 5) equally sized group/folds. By trying different supervised models, such as Ridge regression, Lasso regression, KNN, Support Vector Machines (SVM) etc. I tune and optimize the model to best predict the HSDs results in a given dataset (predicted data).

References:

  1. Li, R., Li, L., Xu, Y. and Yang, J., 2022. Machine learning meets omics: applications and perspectives. Briefings in Bioinformatics, 23(1), p.bbab460.
  2. Mahood, E.H., Kruse, L.H. and Moghe, G.D., 2020. Machine learning: A powerful tool for gene function prediction in plants. Applications in Plant Sciences, 8(7), p.e11376.
  3. Soltis, P.S., Nelson, G., Zare, A. and Meineke, E.K., 2020. Plants meet machines: Prospects in machine learning for plant biology. Applications in Plant Sciences, 8(6).

<Last updated by Xi Zhang on Oct 16th,2022>

bioinformatics_essay25.txt · Last modified: by 134.190.232.124