Description
This software provides a hybrid GO-based semantic similarity algorithm for evaluating the functional similarity between GO terms or gene products. The software uses the pre-downloaded GO database files and the GO annotation files. It allows the users to set organisms, and evidence codes ignored. The software is composed of five modules, getGODAG, getGOAnno, hrssmatrix, hrsstps and hrsspps.
M1 getGODAG: it creates a GO database in a given MySQL database. The GO DAG contains three tables, ¡®term' and ¡®term2term' directly from pre-downloaded files, and ¡®GOPaths_wholeDAG'.
M2 getGOAnno: given an organism or multi-organism (e.g. UniProtKB) GOA database, it parses the GO annotation file.
M3 hrssmatrix: given the same organism or multi-organism (e.g. UniProtKB) GOA database as that in M2, it calculates HRSS matrix for all term pairs in a given DAG or all three DAGs.
M4 hrsstps: given a GO ontology, it returns HRSS values for the term pairs in an input file.
M5 hrsspps: give a GO ontology, it returns functional similarity for the protein pairs in an input file.
Requirement
HRSS runs on Linux system, and requires that MySQL is running in the system.
If ¡®LOAD LOCAL INFILE' statement is disabled, re-enable it from the mysql server side, by altering the my.cnf file (achieving the full path through ¡®whereis my.cnf').
Under the sections [mysqld] and [mysql] add 'local-infile=1':
Restart mysql server with root permission.
See http://dev.mysql.com/doc/refman/5.5/en/load-data-local.html for detail.
Secondly, login MySQL using the user with ¡®creating' privilage, and then create a database with the given name (¡®GOyeast' for example) and grant the user (e.g., name ¡®user' and password ¡®pwd') who will run the software the privileges.
For detail usage please check the MySQL manual: http://dev.mysql.com/doc/#manual.
Installation
The HRSS download page can be accessed here.
HRSS currently runs on linux platform. Simply put downloaded HRSS_version.tar.gz in any directory.
There are three folders, bin, data and results.
> Folder bin:contains (1) source files in the C programming language, (2) ¡®Makefile' to compile the program and (3) Perl script files that will be called by hrss program. Compile the program in your platform in this way:
Then the compiled codes are within the same directory as the source. Add the program path to environmental variable by editing the file like ~/.bash_profile (or /etc/bashrc for root).
> Folder data: contains scripts for running the program and example input files.
> Folder results: contains result files after running the scripts in the folder of data.
Modules
There are five modules, namely getGODAG, getGOAnno, hrssmatrix, hrsstps, hrsspps, in the software. The example scripts are in folder ¡®data' .
__________________________________________________________________
M1: getGODAG -- it creates a GO database in a given MySQL database. The GO DAG contains three tables, ¡®term' and ¡®term2term' directly from pre-downloaded files and ¡®GOPaths_wholeDAG'.
Inputs:
1. infile_of_mysql_info: input file containing the mysql information. This file is required by all modules.
2. directory_of_GO_database : the directory of downloaded GO database files. Only four files, namely term.txt, term.sql, term2term.txt and term2term.sql are used in the program.
Usage:
Example:
Outputs:
MySQL tables ¡®term' and ¡®term2term' created directly from pre-downloaded GO database files.
MySQL table ¡®GOPaths_wholeDAG'containing all possible paths between two terms in any GO DAG.
__________________________________________________________________
M2: getGOAnno -- it parses the GO annotation file.
Inputs:
1. infile_of_mysql_info: input file containing the mysql information.
2. corpus: indicating an organism name or a multi-organism GOA database.
3. infile_of_GOA : the GO annotation (GOA) file for a corpus.
Usage:
Example:
Outputs:
MySQL table ¡®corpus _go' containing the GO annotation information. If the corpus is set as yeast, the table name is ¡®yeast_go'.
Structure of MySQL tables ¡®corpus _GOassos_allevi' (¡®none' evidence code is filtered out) and ¡®corpus _GOassos_filevi' (there are evidence code(s) being filtered out later in M3).
__________________________________________________________________
M3: hrssmatrix -- given an organism or multi-organism (such as UniProt) GOA database, it does all the steps necessary for calculating semantic similarity scores of all term pairs.
Inputs:
1. infile_of_mysql_info: input file containing the mysql information.
2. corpus: indicating an organism name or a multi-organism GOA database.
3. evidence_codes_ignored: a comma-delimited string of evidence code(s) to be filtered out, e.g. 'IEA' and 'IEA,IKR'. If no code is ignored, set the parameter as¡®none.
4. ontology_alldags: the GO ontology to be considered, chosen from:
+ alldags: all three ontologies will be considered separately.
+ BP: only biological process
+ CC: only cellular component
+ MF: only molecular function
Note: Multiple programs in BP, CC and MF ontologies could run simultaneously.
5. directory_for_output: directory for output files.
Usage:
Example:
Outputs:
Updated MySQL table ¡®corpus_GOassos_allevi¡¯ (¡®none¡¯ evidence code is filtered out) OR ¡®corpus_GOassos_filevi¡¯ (there are evidence code(s) being filtered out).
Under output directory, HRSS matrix file of all term pairs (excluding root term) in a DAG is save, e.g., yeast_mx_allevi.MF indicates the matrix for yeast on MF ontology including all evidence codes. If ¡®alldags¡¯ is set, three such files in all three ontologies will be produced.
__________________________________________________________________
M4: hrsstps: it returns the HRSS values for input term pairs in a given GO ontology.
Inputs:
1. infile_of_mysql_info: input file containing the mysql information.
2. corpus: indicating an organism name or a multi-organism GOA database.
3. evidence_codes_ignored: a comma-delimited string of Evidence code(s) to be filtered out, e.g. 'IEA' and 'IEA,IKR'. If no code is ignored, set the parameter as ¡®none'.
4. ontology: the GO ontology to be considered, chosen from {BP, CC, MF}
5. directory_for_matrixfile: directory for pre-computed HRSS matrix file
6. infile_of_termpairs: input file containing tab-delimited term pairs. Only term accession (e.g. GO:0000001) is recognized in the software.
7. outfile_of_termpairs: name of output file
Usage:
Example:
Outputs:
Output file named by parameter outfile_of_termpairs, with tab-delimited string of term1_acc, term2_acc, term1_id, term2_id and HRSS in each line.
__________________________________________________________________ M5: hrsspps: it returns the HRSS values for input protein pairs in a GO ontology.
Inputs:
1. infile_of_mysql_info: input file containing the mysql information.
2. corpus: indicating an organism name or a multi-organism GOA database.
3. evidence_codes_ignored: a comma-delimited string of Evidence code(s) to be filtered out, e.g. 'IEA' and 'IEA,IKR'. If no code is ignored, set the parameter as ¡®none'.
4. pairwise: chosen from {max, bma}. max: maximum approach, bma: best-match average approach.
5. ontology: GO ontology to be considered, chosen from {BP, CC, MF}. Note: Multiple programs in BP, CC or MF could run simultaneously.
6. directory_for_matrixfile: directory for pre-computed HRSS matrix file
7. infile_of_proteinpairs: input file containing tab-delimited protein pairs. Only ¡®DB_Object_ID' in downloaded GO annotation file is recognized in the software.
8. outfile_of_proteinairs: name of output file
Usage:
Example:
Outputs:
Output file named by parameter outfile_of_proteinpairs, with tab-delimited string of protein1, protein2 and HRSS in each line.
Note:
The HRSS values of two proteins both with valid GO annotations range from 0 to 1. There are three negative values for a protein pair, -2, -3 and -4, indicating one or both proteins have no GO annotation.
Given the protein pair P and Q, -2 means Q has no GO annotation, -3 means P has no annotation and -4 means both P and Q have no annotation.
Walkthrough examples
Two script files are under folder ¡®data' , one for considering all evidence codes, and the other one for excluding IEA (Inferred from Electronic Annotation) codes. The step by step example is for considering all annotations. All input files in the example are in folder ¡®data' . Run the commands under the directory of ¡®data'
Step 1. running M1 ( getGODAG ) to create a MySQL database of GO DAG
Step 2. run M2 (getGOAnno) to parse the pre-downloaded GO annotation file of yeast.
Step 3. run M3 (hrssmatrix) to calculate HRSS values for all term pairs in three ontologies when all evidence codes are considered.
Then, users can fetch HRSS values for the term pairs or protein pairs of interests.
Step 4. run M4 (hrsstps) to fetch HRSS values (yeast annotation on BP including all annotations) for the term pairs in an input file.
Step 5. run M5 (hrsspps) to fetch HRSS (MAX) values (yeast annotation on BP including all annotations) for the protein pairs in an input file.
|