######################################################################
 
README
 
######################################################################

=====================================
Usage Description
=====================================
The module of Orthology Protein-coding genes Pairs (OPPs) identification is used to collect orthologous relationship among 15way species guided by collinear segments.
Four main steps are performed as: 
	(1) Step1: Collect the genesets located in all n-way[2way-n] collinear_segments. 
	(2) Step2: geneset2genemap: For two species (with cucumber as ref), select the genes which located in the same collinear segment.
	(3) Step3: genemap_blast_filter: For each pair of genes located in the same segment, if they share no similarity (blast E-value), they will be filtered out.
	(4) Step4: For each pair of OPPs, we calculate its collinear segments support score (OPSS: Ortholog_Protein-coding genes_Support_Score).

 
=====================================
Directory Contents
=====================================
 
This directory includes README, input directory, output directory and scripts directory. 
 
 
The sections below include:
 
	scripts directory
		run.sh
		pl_scripts
			Step2.pl		
			Step3.1.pl
			Step3.2.pl
			Step4.pl
	Input directory
		blast_results directory
			ref_spec.blast 
			BBH.results 
		pro_gff_dir directory
			species_pro.gff3 
		segments_gff3 directory
			n-way directory
				{n}way_segments.gff3 
			2way-n directory
				2way-{n}_segments.gff3 
	Output directory
		geneset directory
			n-way directory
				spec_{n}way_geneset.table
			2way-n directory
				spec_2way-{n}_geneset.table
		genemap directory
			n-way directory
				ref_spec_{n}way_good.genemap
			2way-n directory
				ref_spec_2way-{n}_good.genemap
		OPPs directory
			Final.opps file
			OPPs_MAAs-based.pair file
			OPPs_proteins-based.pair file
			2way-n_all.opps file
			n-way_all.opps file
			spec_opps directory
				ref_spec_n-way(2way-n).opps
	README file
 
 
=====================================
run.sh file
=====================================
##################################################################################################################################################################################
# Prepare dataset: n-way(2way-n)_segments_gff3/* produced by collinear_detection.sh 	                			
# Description: 
#	Step 1 Collect the genesets located in all n-way[2way-n] collinear_segments.			   			   	
#	Step 2 geneset2genemap:	For two species (with cucumber as ref), select the genes which located in the same colliner segment.													
#	Step 3 genemap_blast_filter: For each pair of genes located in the same segment, if they share no similarity (blast E-value), they will be filtered out. 						
#	Step 4 For each pair of OPPs, to calculate its collinear segments support score (OPSS: Ortholog_Protein-coding genes_Support_Score) 										
#	
#	Dependency tools: bedtools2-master, pl_scripts/*.pl, BBH.pair(obtained by blastp all_vs_all and extract BBH (Bidirectional Best Hits) relationships )									
#	Usage: sh collinear.sh 			   																				
#	Input: n-way[2way-n]_collinear.gff3, spec_pro_gff3 including protein-coding genes annotation    			
#	Output: Final.opps 															                    			
#	output with six columns as ref_geneID, spec_geneID, M_score (MAAs-based), P_score (proteins-based), BBH_score?(0:no, 2:yes) and OPSS_score (= M_score + P_score + BBH_score)'
###################################################################################################################################################################################



=====================================
scripts directory
=====================================
This directory provides the dependency four perl scripts required in run.sh.

(1)	Step2.pl was used to perform geneset2genemap required in step2 of run.sh
#################################################################################################
#	This script can be used to prepare genemap.table file used in OPPs_identify.sh as Step2.
#	n-way[2way-n].genemap.table: 3 columns including segmentID, ref_geneID, spec_geneID.	
#	Usage: perl Step2.pl ref_geneset.table spec_geneset.table > ref_spec_genemap.table.		
#	Input: ref_geneset.table, spec_geneset.table											
#	Output: ref_spec_genemap.table															
#################################################################################################

(2)	Step3.1.pl was used to prepare good.genemap file in step3.1 of run.sh
#########################################################################################################
#	This script can be used to prepare good.genemap file used in OPPs_identify.sh as Step3.1.		
#	good.genemap: 3 columns including segmentID, ref_geneID, spec_geneID							
#	Usage: perl Step3.1.pl ref_spec.genemap ref_spec.blast blast_evalue_cutoff > ref_spec_good.genemap			
#	Input: (1) ref_spec.genemap; (2) ref_spec.blast; (3) blast_evalue_cutoff						
#	Output: ref_spec_good.genemap with three columns as as segmentID, ref_geneID and spec_geneID	
#########################################################################################################

(3)	Step3.2.pl was used to prepare good.genemap file in step3.2 of run.sh
#########################################################################################################################
#	This script can be used to collect the nway[2way-n] information for each OPPs used in OPPs_identify.sh as Step3.2
#	Usage: perl Step3.2.pl ref_spec_n-way[2way-n]_good.genemap spec n-way[2way-n] > n-way[2way-n]_all.opps								
#	Input: [1] ref_spec_n-way[2way-n]_good.genemap;[2] spec; [3] n-way[2way-n]										
#	Output: n-way[2way-n]_all.opps																					
#	output with 5 Columns as: (1)ref_geneID;(2)spec_geneID;(3)spec;(4)n-way(2way-n);(5)segmentID					
#########################################################################################################################

(4)	Step4.pl was used to prepare good.genemap file in step3.2 of run.sh
#####################################################################################################################################################################################
#	This script can be used to calculate OPSS (Orthology_Protein-coding genes_Support_Score) for each OPPs used in OPPs_identify.sh as Step4										
#	Usage: perl Step4.pl OPPs_MAAs-based.pair OPPs_proteins-based.pair BBHs.pair > Final.opps																					
#	Input: 
#		(1) OPPs_MAAs-based.pair: the OPPs results obtained by using MAAs as genomic markers as above steps																	
#		(2) OPPs_proteins-based.pair: the OPPs results obtained by using protein-coding genes as genomic markers to perform above all steps										
#		(3) BBHs.pair: BBH.results were obtained by blastp all_vs_all and extract BBH (Bidirectional Best Hits) relationships with BBH_extract.pl								
#		(4) p_weight: indicating a [0,1] number as weight of score obtained by protein-based method																				
#	Output: Final.opps																																							
#	Final.opps: six columns as as ref_geneID, spec_geneID, M_score (MAAs-based), P_score (proteins-based), BBH_score?(0:no, 2:yes) and OPSS_score (= M_score + P_score + BBH_score)	
#####################################################################################################################################################################################


=====================================
Input directory
=====================================
This directory provides:
(1) blast_results obtained by blast_all_vs_all. 
(2) segments_gff3 produced by collinear_detection.sh.
(3) pro_gff3 (indicating the protein-coding genes annotation information) extracted from genome annotation files. 


=====================================
Output directory
=====================================
This directory provides: 
(1) geneset files as n-way[2way-n]_geneset.table with two columns as geneID and multipliconID (segmentID).
(2) genemap files as ref_spec_n-way[2way-n]_good.genemap with 5 Columns as: (1)ref_geneID;(2)spec_geneID;(3)spec;(4)n-way(2way-n);(5)segmentID.
(3) OPPs directory provides:
	[3.1] Final.opps with 6 columns as as ref_geneID, spec_geneID, M_score, P_score, BBH_score and OPSS_score.
	[3.2] n-way(2way-n)_all.opps is OPPs results guided by two levels of collinear segments, with 4 Columns as ref_geneID, spec_geneID, spec, way and segmentID.
	[3.3] OPPs_MAAs-based.pair is the total OPPs results, with with 4 Columns as ref_geneID, spec_geneID, spec, way and segmentID.
	[3.4] OPP_proteins-based.pair is the OPPs results obtained by using protein-coding genes as genomic markers instead of MAAs and performing the same pipeline.
	[3.4] spec_opps is the OPPs for Cucumber and one of other 14 species in different collinear levels (n-way and 2way-n), with 4 Columns as ref_geneID, spec_geneID, spec, way and segmentID.

two types of collinear segments files as table and gff3 files, and the segments at two different levels (n-way and 2way-n) are displayed individually.

=====================================
README file
=====================================
It is this file.

 



