bioinformatics_tools3

Parsing the InterProScan results

InterProScan is a powerful and useful protein identifier explorer.However, the result of it is not easy to handle. For example, the example of it have many columns and duplicated rows. Here, we are going to use script to merge the same gene ID to one lines. So the script is called interproscan_to_one_line.py. Please find the difference before and after using the script on the TAIR 10 data. The Homo_sapiens data is at the bottom of the page.

Before:

AT1G11750.1	e35d0d54d5d209033484e31afc0ff009	271	Coils	Coil	Coil	210	230	-	T	27-09-2021	-	-
AT1G11750.1	e35d0d54d5d209033484e31afc0ff009	271	Hamap	MF_00444	ATP-dependent Clp protease proteolytic subunit [clpP].	84	270	33.141739	T	27-09-2021	IPR001907	ATP-dependent Clp protease proteolytic subunit	GO:0004176|GO:0004252|GO:0006508	-
AT1G11750.1	e35d0d54d5d209033484e31afc0ff009	271	CDD	cd07017	S14_ClpP_2	96	264	3.96818E-88	T	27-09-2021	IPR001907	ATP-dependent Clp protease proteolytic subunit	GO:0004176|GO:0004252|GO:0006508	-
AT1G11750.1	e35d0d54d5d209033484e31afc0ff009	271	ProSitePatterns	PS00382	Endopeptidase Clp histidine active site.	189	202	-	T	27-09-2021	IPR033135	ClpP, histidine active site	-	-
AT1G11750.1	e35d0d54d5d209033484e31afc0ff009	271	Pfam	PF00574	Clp protease	96	269	3.7E-65	T	27-09-2021	IPR023562	Clp protease proteolytic subunit /Translocation-enhancing protein TepA	-	MetaCyc: PWY-7884

After:

AT1G11750.1	PF00574	Clp protease
AT5G53350.1	PF10431,PF07724	C-terminal, D2-small domain, of ClpB protein,AAA domain (Cdc48 subfamily)
AT4G17040.1	PF00574	Clp protease
AT5G15450.1	PF10431,PF17871,PF02861,PF02861,PF07724,PF00004	C-terminal, D2-small domain, of ClpB protein,AAA lid domain,Clp amino terminal domain, pathogenicity island component,Clp amino terminal domain, pathogenicity island component,AAA domain (Cdc48 subfamily),ATPase family associated with various cellular activities (AAA)
AT5G51070.1	PF07724,PF10431,PF02861,PF02861,PF00004,PF17871	AAA domain (Cdc48 subfamily),C-terminal, D2-small domain, of ClpB protein,Clp amino terminal domain, pathogenicity island component,Clp amino terminal domain, pathogenicity island component,ATPase family associated with various cellular activities (AAA),AAA lid domain

Because we only displayed the one contain “Pfam” protein patterns.User can change the parameters: Pfam to others e.g., CDD, Hamap to filter out the others.

  • Step One: Running the InterProScan with your interested protein fasta files.
>/interproscan.sh -i proteins_of_your_genome.fasta -f tsv -dp

If you are new to InterProScan, please find detailed usage at the Step #6 in (Zhang et.al 2021)(https://doi.org/10.1016/j.xpro.2021.100619)

python3 interproscan_to_one_line.py test_data.tsv out.txt Pfam

InterproScan data result for human-being

Homo_sapiens: GCF_000001405.39_GRCh38.p13_protein.faa

1. This protein includes alternative splicing transcripts translated proteins (e.g.,XX.t1, XX.t2, XX.t3) Use this script (isoform2one https://github.com/zx0223winner/isoform2one) to select the primary transcript protein sequence (longest transcript)

2. Then run the Interproscan analysis. Results documented here: /misc/scratch2/xizhang/HSDatabase/Results/Interproscan_

<Last updated by Xi Zhang on Oct 8th,2021>

bioinformatics_tools3.txt · Last modified: by 134.190.232.106