What is LiSSI?
LiSSI stands for Life-Style-Specific-Islands. It was
developed to identify islands mainly associated with a given
LiSSI is divided into three subsequencial modules:
Optionally, the tool can be used without the islands detection. In that
case, it will report putative homologous genes that are mainly
associated with a given life-style.
Evolutionary Sequence Analysis
To obtain clusters of putative homologous gene products, we perform computational homology detection using a combination of BLAST [doi] and Transitivity Clustering [doi]. We apply BLAST to our protein-coding genes all-versus-all (Default e-value cutoff of 0.01) to obtain a pairwise similarity matrix. In this matrix, the similarity values are converted into the –log10 of the best achieved pairwise BLAST E-value. An E-value of 10-53 for two proteins A and B would consequently result in a similarity of 53 between them: similarity(A,B) = 53. Transitivity Clustering transforms this similarity matrix into a weighted similarity graph, where genes and similarities were considered as nodes and weighted edges, respectively. The software uses a similarity cutoff (so-called density parameter) and removes all edges below this value. Afterwards, the potentially intransitive graph is transformed into a transitive one by adding and removing edges with minimal edge modification costs (Weighted Cluster Editing problem). Transitivity Clustering ensures that the average similarity between clusters is below the cutoff while the average similarity between genes from the same cluster is above the threshold.The methodology has proven robust for predicting clusters of homologous genes and proteins based on pairwise BLAST results.
We apply Gecko [doi], a tool that identifies conserved consecutive homology sequences (islands) in a large number of genomes. The main challenge faced by the algorithm is that gene cluster conservation is usually no perfect. Therefore, the tool must accept approximate gene clusters with, for instance, gene deletion or inversion. It utilizes a strategy based on reference occurrences sets one genome as reference and detects approximate gene clusters in all other genomes ( the procedure is repeated for all analyzed genomes). It requires three parameters: maximum distance between clusters (i.e., deletions or insertions), minimum island size and minimum number of genomes that the island is present. Furthermore, Gecko estimates the statistical significance of the island, i.e., the probability of observing a given island with the same or higher degree of conservation in multiple genomes. It assumes as a null hypothesis that the gene order is random [doi].
Statistical Learning Methods
To ensure that the classification would be based on the presence of a given island (or homologous
gene), we performed our classification and feature selection utilizing life-style-specific features
(i.e., features that are mainly present in a given life-style). We used the R package randomForest
to generate Random Forest (RF) classifiers using lifestyle-specific features. RF generates many
classifiers (de-correlated trees) and aggregates their results. The final class label prediction is
based on majority vote. Each tree was constructed using a different bootstrapped sample of the data,
and each node was split using the best predictor among a randomly chosen subset. RF is frequently used
in the life sciences because it deals comparably well with p » n datasets (in our case: number of islands/clusters
» number of lifestyles) [doi].
To access a robust quality estimation of the classifier, the data is evaluated using a 5-fold cross validation. Also, this procedure is repeated five times (default value) using different cross validation sets. We could therefore analyze the robustness of the classification towards changes in the homology data sets. Furthermore, we compare the emerging RF classifiers against the predictions performed with randomized labels. By using exactly the same classification and cross validation pipeline, we aimed to classify the data not into their real classes (for instance, pathogenicity labels) but we assigned each organism a random label instead. We may assume a drastic drop in the classification performance when classifying the data with random labels, preferably close to that of a random classifier (50% accuracy in a two-class learning problem). This allowed us to assess the classification robustness. For all classifiers, with real labels and with random labels, ROC (receiver operating characteristics) plots were generated to inspect their performance.
Given RF’s characteristic of random exploration of features, it can also be utilized for feature selection [doi]. The R package varSelRF was used to identify the most discriminant features for each lifestyle. The package successively eliminated the least important variables using the so-called out-of-bag error (RF internal error estimate) as minimization criterion. Afterwards, R package rpart (Recursive Partitioning and Regression Trees) is used to generate decision trees by applying the so-called Gini Index as maximization criterion (and standard values otherwise). The clusters of homologous genes used in the tree’s nodes were named by using a simple majority vote while scanning all gene product descriptions of the cluster.
Barbosa, Eudes, et al. "LifeStyle-Specific-Islands (LiSSI): Integrated Bioinformatics Platform for Genomic Island Analysis." Journal of Integrative Bioinformatics (2017).