User Tools

Site Tools


awk_for_tabulated_files

The command awk can be really useful to edit or parse tabulated files (for example: blast outputs in columns separated by a tabulation = -outfmt 6; or gff files).

By default, awk scans a file line by line, whereby a line is ending by a carriage return (\n) and further split the line into fields, by default separated by a tabulation “\t” although other field separators can be defined.

We will see how to use awk on a blast output file (-outfmt 6) named blast.output which first lines look like that:

user$ head -7 blast.output
BUSSELTON_g28320.t1	Seq_26_pilon_pilon	45.45	66	34	2	27	92	266496	266305	2e-07	57.4
BUSSELTON_g29060.t1	Seq_133_pilon_pilon	24.01	279	171	9	398	668	26316	27053	6e-13	74.7
BUSSELTON_g29060.t1	Seq_35_pilon_pilon	32.67	150	83	6	531	678	24051	23650	1e-07	57.4
BUSSELTON_g29223.t1	Seq_17_pilon_pilon	46.67	195	79	1	1103	1272	499049	499633	9e-49	 193
BUSSELTON_g29223.t1	Seq_17_pilon_pilon	68.89	90	27	1	594	683	498684	498950	8e-44	 137
BUSSELTON_g29223.t1	Seq_17_pilon_pilon	77.14	35	8	0	684	718	498950	499054	8e-44	62.0
BUSSELTON_g29223.t1	Seq_17_pilon_pilon	37.75	151	93	1	1381	1531	499664	500113	6e-23	 108
# Fields: query id, subject id, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, send, evalue, bit score

How to invert 2 columns (fields)
ex: Inverting the query (column 1) and the target (column 2)

user$ awk -F "\t" '{print $2"\t"$1"\t"$3"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8"\t"$9"\t"$10"\t"$11"\t"$12}' blast.output
Seq_26_pilon_pilon	BUSSELTON_g28320.t1	45.45	66	34	2	27	92	266496	266305	2e-07	57.4
Seq_133_pilon_pilon	BUSSELTON_g29060.t1	24.01	279	171	9	398	668	26316	27053	6e-13	74.7
Seq_35_pilon_pilon	BUSSELTON_g29060.t1	32.67	150	83	6	531	678	24051	23650	1e-07	57.4
Seq_17_pilon_pilon	BUSSELTON_g29223.t1	46.67	195	79	1	1103	1272	499049	499633	9e-49	 193
Seq_17_pilon_pilon	BUSSELTON_g29223.t1	68.89	90	27	1	594	683	498684	498950	8e-44	 137
Seq_17_pilon_pilon	BUSSELTON_g29223.t1	77.14	35	8	0	684	718	498950	499054	8e-44	62.0
Seq_17_pilon_pilon	BUSSELTON_g29223.t1	37.75	151	93	1	1381	1531	499664	500113	6e-23	 108
#-F "\t" is used to say that the fields in the blast.output file are separated by a tabulation

How to use the if statement
ex 1: printing a line if the name of the query (first column) contains “g29”

user$ awk -F "\t" '{if ($1 ~ /g29/) print}' blast.output
BUSSELTON_g29060.t1	Seq_133_pilon_pilon	24.01	279	171	9	398	668	26316	27053	6e-13	74.7
BUSSELTON_g29060.t1	Seq_35_pilon_pilon	32.67	150	83	6	531	678	24051	23650	1e-07	57.4
BUSSELTON_g29223.t1	Seq_17_pilon_pilon	46.67	195	79	1	1103	1272	499049	499633	9e-49	 193
BUSSELTON_g29223.t1	Seq_17_pilon_pilon	68.89	90	27	1	594	683	498684	498950	8e-44	 137
BUSSELTON_g29223.t1	Seq_17_pilon_pilon	77.14	35	8	0	684	718	498950	499054	8e-44	62.0
BUSSELTON_g29223.t1	Seq_17_pilon_pilon	37.75	151	93	1	1381	1531	499664	500113	6e-23	 108
#the query in the first line BUSSELTON_g28320.t1 is the only one that do not contains "g29" so was not printed

ex 2: printing a line if the start of the hit in the target sequence (column 9 (s. start)) is greater than 499000

$user awk -F "\t" '{if ($9 > 499000) print}' blast.output
BUSSELTON_g29223.t1	Seq_17_pilon_pilon	46.67	195	79	1	1103	1272	499049	499633	9e-49	 193
BUSSELTON_g29223.t1	Seq_17_pilon_pilon	37.75	151	93	1	1381	1531	499664	500113	6e-23	 108

How to use the if statement with 2 conditions
ex: printing a line if the name of the query contains “g29” AND if the the hit in the target sequence (column 9) is smaller than 499000

$user awk -F "\t" '{if (($9 < 499000) && ($1 ~ /g29/)) print}' blast.output
BUSSELTON_g29060.t1	Seq_133_pilon_pilon	24.01	279	171	9	398	668	26316	27053	6e-13	74.7
BUSSELTON_g29060.t1	Seq_35_pilon_pilon	32.67	150	83	6	531	678	24051	23650	1e-07	57.4
BUSSELTON_g29223.t1	Seq_17_pilon_pilon	68.89	90	27	1	594	683	498684	498950	8e-44	 137
BUSSELTON_g29223.t1	Seq_17_pilon_pilon	77.14	35	8	0	684	718	498950	499054	8e-44	62.0
#&& mean AND, both conditions must be filled

ex: printing a line if the name of the query contains “g29” OR if the the hit in the target sequence (column 9) is smaller than 499000

user$ awk -F "\t" '{if (($9 < 499000) || ($1 ~ /g29/)) print}' blast.output
BUSSELTON_g28320.t1	Seq_26_pilon_pilon	45.45	66	34	2	27	92	266496	266305	2e-07	57.4
BUSSELTON_g29060.t1	Seq_133_pilon_pilon	24.01	279	171	9	398	668	26316	27053	6e-13	74.7
BUSSELTON_g29060.t1	Seq_35_pilon_pilon	32.67	150	83	6	531	678	24051	23650	1e-07	57.4
BUSSELTON_g29223.t1	Seq_17_pilon_pilon	46.67	195	79	1	1103	1272	499049	499633	9e-49	 193
BUSSELTON_g29223.t1	Seq_17_pilon_pilon	68.89	90	27	1	594	683	498684	498950	8e-44	 137
BUSSELTON_g29223.t1	Seq_17_pilon_pilon	77.14	35	8	0	684	718	498950	499054	8e-44	62.0
BUSSELTON_g29223.t1	Seq_17_pilon_pilon	37.75	151	93	1	1381	1531	499664	500113	6e-23	 108
#the 2 pipes caracters || mean OR, either conditions must be filled

How to use the if and else statments
ex: printing the first column of a line if the query (first column) contains “g29”, else print the full line

user$ awk -F "\t" '{if ($1 ~ /g29/) print $1; else print}' blast.output
BUSSELTON_g28320.t1	Seq_26_pilon_pilon	45.45	66	34	2	27	92	266496	266305	2e-07	57.4
BUSSELTON_g29060.t1
BUSSELTON_g29060.t1
BUSSELTON_g29223.t1
BUSSELTON_g29223.t1
BUSSELTON_g29223.t1
BUSSELTON_g29223.t1

How to make numeric operations on certain fields
ex 1: printing the column 2, the column 10 and the column 10 -100

user$ awk -F "\t" '{print $2"\t"$10"\t"$10-100}' blast.output
Seq_26_pilon_pilon	266305	266205
Seq_133_pilon_pilon	27053	26953
Seq_35_pilon_pilon	23650	23550
Seq_17_pilon_pilon	499633	499533
Seq_17_pilon_pilon	498950	498850
Seq_17_pilon_pilon	499054	498954
Seq_17_pilon_pilon	500113	500013

ex 2: if the column 10 is greater than the column 9, print the column 2, the column 9 -100 and the column 9; else (if the column 9 is greater than the column 10) printing the column 2, the column 10 and the column 10 +100

user$ awk -F "\t" '{if ($9 > $10) print $2"\t"$9-100"\t"$9; else print $2"\t"$10"\t"$10+100}' blast.output
Seq_26_pilon_pilon	266396	266496
Seq_133_pilon_pilon	27053	27153
Seq_35_pilon_pilon	23951	24051
Seq_17_pilon_pilon	499633	499733
Seq_17_pilon_pilon	498950	499050
Seq_17_pilon_pilon	499054	499154
Seq_17_pilon_pilon	500113	500213
awk_for_tabulated_files.txt · Last modified: by 156.34.16.174