User Tools

Site Tools


awk_for_tabulated_files

This is an old revision of the document!


The command awk can be really useful to edit or parse tabulated files (for example: blast output in columns separated by a tabulation = -outfmt 6; or gff files).

By default, awk scans a file line by line, whereby a line is ending by a carriage return (\n) and further split the line into fields, by default separated by a tabulation “\t” although other field separators can be defined.

We will see how to use awk on a blast output file (-outfmt 6) named blast.output which first lines look like that:

user$ head -7 blast.output
BUSSELTON_g28320.t1	Seq_26_pilon_pilon	45.45	66	34	2	27	92	266496	266305	2e-07	57.4
BUSSELTON_g29060.t1	Seq_133_pilon_pilon	24.01	279	171	9	398	668	26316	27053	6e-13	74.7
BUSSELTON_g29060.t1	Seq_35_pilon_pilon	32.67	150	83	6	531	678	24051	23650	1e-07	57.4
BUSSELTON_g29223.t1	Seq_17_pilon_pilon	46.67	195	79	1	1103	1272	499049	499633	9e-49	 193
BUSSELTON_g29223.t1	Seq_17_pilon_pilon	68.89	90	27	1	594	683	498684	498950	8e-44	 137
BUSSELTON_g29223.t1	Seq_17_pilon_pilon	77.14	35	8	0	684	718	498950	499054	8e-44	62.0
BUSSELTON_g29223.t1	Seq_17_pilon_pilon	37.75	151	93	1	1381	1531	499664	500113	6e-23	 108
# Fields: query id, subject id, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, send, evalue, bit score

How to invert 2 columns (fields) ex: Inverting the query (column 1) and the target (column 2)

user$ awk -F "\t" '{print $2"\t"$1"\t"$3"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8"\t"$9"\t"$10"\t"$11"\t"$12}' blast.output
Seq_26_pilon_pilon	BUSSELTON_g28320.t1	45.45	66	34	2	27	92	266496	266305	2e-07	57.4
Seq_133_pilon_pilon	BUSSELTON_g29060.t1	24.01	279	171	9	398	668	26316	27053	6e-13	74.7
Seq_35_pilon_pilon	BUSSELTON_g29060.t1	32.67	150	83	6	531	678	24051	23650	1e-07	57.4
Seq_17_pilon_pilon	BUSSELTON_g29223.t1	46.67	195	79	1	1103	1272	499049	499633	9e-49	 193
Seq_17_pilon_pilon	BUSSELTON_g29223.t1	68.89	90	27	1	594	683	498684	498950	8e-44	 137
Seq_17_pilon_pilon	BUSSELTON_g29223.t1	77.14	35	8	0	684	718	498950	499054	8e-44	62.0
Seq_17_pilon_pilon	BUSSELTON_g29223.t1	37.75	151	93	1	1381	1531	499664	500113	6e-23	 108
#-F "\t" is used to say that the fields in the blast.output file are separated by a tabulation

How to use the if statement ex 1: printing a line if the name of the query (first column) contains “g29”

user$ awk -F "\t" '{if ($1 ~ /g29/) print}' blast.output
BUSSELTON_g29060.t1	Seq_133_pilon_pilon	24.01	279	171	9	398	668	26316	27053	6e-13	74.7
BUSSELTON_g29060.t1	Seq_35_pilon_pilon	32.67	150	83	6	531	678	24051	23650	1e-07	57.4
BUSSELTON_g29223.t1	Seq_17_pilon_pilon	46.67	195	79	1	1103	1272	499049	499633	9e-49	 193
BUSSELTON_g29223.t1	Seq_17_pilon_pilon	68.89	90	27	1	594	683	498684	498950	8e-44	 137
BUSSELTON_g29223.t1	Seq_17_pilon_pilon	77.14	35	8	0	684	718	498950	499054	8e-44	62.0
BUSSELTON_g29223.t1	Seq_17_pilon_pilon	37.75	151	93	1	1381	1531	499664	500113	6e-23	 108
#the query in the first line BUSSELTON_g28320.t1 is the only one that do not contains "g29" so was not printed
<code>

ex 2: printing a line if the start of the hit in the target sequence (column X) is greater than XXX

<code> How to use the if statement with 2 conditions printing a line if the name of the query contains “” AND if the the hit in the target sequence (column X) is greater than XXX

printing a line if the name of the query contains “” OR if the the hit in the target sequence (column X) is greater than XXX

How to use the if and else statments printing the first column of a line if the query (first column) contains “”, else print the full line

How to make numeric operations

awk_for_tabulated_files.1625584024.txt.gz · Last modified: by 156.34.16.174