Domains are the building blocks of all globular proteins and present one of the most useful levels at which to understand protein function. Nevertheless, the multidomain nature of many proteins confounds the approaches that rely on sequence similarity measures. Only if the proteins that exhibit highly similar domain architectures should they be compared or classified into a single family. In this project, we used the Pfam (Bateman et al., 2002; Bateman et al., 2004) database as the source for protein domain definitions. By combining three different indices, the Jaccard index, the Goodman-Kruskal γ function, and the domain duplicate index, into a similarity measure, we proposed a novel method to compare a set of domain architectures or proteins with multiple domains at the level of whole domain architectures. The method is evaluated using the KOGs database (Tatusov et al., 2000; Tatusov et al., 2003) and shows that it is suitable for comparing similarities of proteins with complex domain architectures (Lin et al., submitted).
- Pfam version 22.0: consisting of 9318 families.
- Statistically, 73.23% of all proteins in Pfamseq 22 contain a match to at least one Pfam entry. More details please refer to Pfam.
- Similarity metric
- Jaccard index: measuring how many common domains two proteins have.
- Goodman-Kruskal γ function: estimating the arrangement similarity of distinct domains shared by two proteins.
- domain duplicate index: assessing how similar in duplication the individual domains between two proteins are.
- Distance matrix & clustering approach. In order to compare a list of domain architectures or proteins based on their domain architecture similarities, the distance-based neighbor joining algorithm (Saitou and Nei,1987), which is implemented in PHYLIP (Felsenstein, 2004), is applied to classify those domain architectures or proteins into different clusters. In addition, other approaches can also be implemented, such as UPGMA (Sokal and Sneath,1973), MCL (Markov cluster algorithm) (Dongen,1998) etc.
- Bateman, A., et al. (2002) The Pfam protein families database. Nucleic Acids Res, 30, 276-280.
- Bateman, A., et al. (2004) The Pfam protein families database. Nucleic Acids Res, 32, D138-141
- Dongen, S. v. (1998). A new cluster algorithm for graphs, Centrum voor Wiskunde en Informatica (CWI).
- Felsenstein, J. (2004). Phylogeny Inference Package.
- Saitou, N. and Nei, M. (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol, 4, 406-425.
- Sokal, R. and Sneath, P. (1973). Numerical Taxonomy. San Francisco, Freeman.
- Tatusov, R. L., et al. (2003) The COG database: an updated version includes eukaryotes. BMC Bioinformatics, 4, 41.
- Tatusov, R. L., et al. (2000) The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res, 28, 33-36.