awk_for_tabulated_files
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| awk_for_tabulated_files [2021/06/23 12:29] – 71.7.206.175 | awk_for_tabulated_files [2021/07/06 12:42] (current) – 156.34.16.174 | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| + | The command awk can be really useful to edit or parse tabulated files (for example: blast outputs in columns separated by a tabulation = -outfmt 6; or gff files). | ||
| + | |||
| By default, awk scans a file line by line, whereby a line is ending by a carriage return (\n) and further split the line into fields, by default separated by a tabulation " | By default, awk scans a file line by line, whereby a line is ending by a carriage return (\n) and further split the line into fields, by default separated by a tabulation " | ||
| - | The command awk can be really useful to edit or parse tabulated files (for example: blast output in columns separated by a tabulation = -outfmt 6; or gff files). | ||
| - | We will see how to use awk on a blast output file (-outfmt 6) | ||
| - | How to invert 2 columns (fields) | + | We will see how to use awk on a blast output file (-outfmt 6) named blast.output which first lines look like that: |
| + | < | ||
| + | user$ head -7 blast.output | ||
| + | BUSSELTON_g28320.t1 Seq_26_pilon_pilon 45.45 66 34 2 27 92 266496 266305 2e-07 57.4 | ||
| + | BUSSELTON_g29060.t1 Seq_133_pilon_pilon 24.01 279 171 9 398 668 26316 27053 6e-13 74.7 | ||
| + | BUSSELTON_g29060.t1 Seq_35_pilon_pilon 32.67 150 83 6 531 678 24051 23650 1e-07 57.4 | ||
| + | BUSSELTON_g29223.t1 Seq_17_pilon_pilon 46.67 195 79 1 1103 1272 499049 499633 9e-49 193 | ||
| + | BUSSELTON_g29223.t1 Seq_17_pilon_pilon 68.89 90 27 1 594 683 498684 498950 8e-44 137 | ||
| + | BUSSELTON_g29223.t1 Seq_17_pilon_pilon 77.14 35 8 0 684 718 498950 499054 8e-44 62.0 | ||
| + | BUSSELTON_g29223.t1 Seq_17_pilon_pilon 37.75 151 93 1 1381 1531 499664 500113 6e-23 108 | ||
| + | # Fields: query id, subject id, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, send, evalue, bit score | ||
| + | </ | ||
| + | |||
| + | |||
| + | How to invert 2 columns (fields) | ||
| + | ex: Inverting the query (column 1) and the target (column 2) | ||
| + | < | ||
| + | user$ awk -F " | ||
| + | Seq_26_pilon_pilon BUSSELTON_g28320.t1 45.45 66 34 2 27 92 266496 266305 2e-07 57.4 | ||
| + | Seq_133_pilon_pilon BUSSELTON_g29060.t1 24.01 279 171 9 398 668 26316 27053 6e-13 74.7 | ||
| + | Seq_35_pilon_pilon BUSSELTON_g29060.t1 32.67 150 83 6 531 678 24051 23650 1e-07 57.4 | ||
| + | Seq_17_pilon_pilon BUSSELTON_g29223.t1 46.67 195 79 1 1103 1272 499049 499633 9e-49 193 | ||
| + | Seq_17_pilon_pilon BUSSELTON_g29223.t1 68.89 90 27 1 594 683 498684 498950 8e-44 137 | ||
| + | Seq_17_pilon_pilon BUSSELTON_g29223.t1 77.14 35 8 0 684 718 498950 499054 8e-44 62.0 | ||
| + | Seq_17_pilon_pilon BUSSELTON_g29223.t1 37.75 151 93 1 1381 1531 499664 500113 6e-23 108 | ||
| + | #-F " | ||
| + | </ | ||
| + | |||
| + | |||
| + | How to use the **if** statement \\ | ||
| + | ex 1: printing a line if the name of the query (first column) contains " | ||
| + | < | ||
| + | user$ awk -F " | ||
| + | BUSSELTON_g29060.t1 Seq_133_pilon_pilon 24.01 279 171 9 398 668 26316 27053 6e-13 74.7 | ||
| + | BUSSELTON_g29060.t1 Seq_35_pilon_pilon 32.67 150 83 6 531 678 24051 23650 1e-07 57.4 | ||
| + | BUSSELTON_g29223.t1 Seq_17_pilon_pilon 46.67 195 79 1 1103 1272 499049 499633 9e-49 193 | ||
| + | BUSSELTON_g29223.t1 Seq_17_pilon_pilon 68.89 90 27 1 594 683 498684 498950 8e-44 137 | ||
| + | BUSSELTON_g29223.t1 Seq_17_pilon_pilon 77.14 35 8 0 684 718 498950 499054 8e-44 62.0 | ||
| + | BUSSELTON_g29223.t1 Seq_17_pilon_pilon 37.75 151 93 1 1381 1531 499664 500113 6e-23 108 | ||
| + | #the query in the first line BUSSELTON_g28320.t1 is the only one that do not contains " | ||
| + | </ | ||
| + | |||
| + | |||
| + | ex 2: printing a line if the start of the hit in the target sequence (column 9 (s. start)) is greater than 499000 | ||
| + | < | ||
| + | $user awk -F " | ||
| + | BUSSELTON_g29223.t1 Seq_17_pilon_pilon 46.67 195 79 1 1103 1272 499049 499633 9e-49 193 | ||
| + | BUSSELTON_g29223.t1 Seq_17_pilon_pilon 37.75 151 93 1 1381 1531 499664 500113 6e-23 108 | ||
| + | </ | ||
| + | |||
| + | How to use the **if** statement with 2 conditions \\ | ||
| + | ex: printing a line if the name of the query contains " | ||
| + | < | ||
| + | $user awk -F " | ||
| + | BUSSELTON_g29060.t1 Seq_133_pilon_pilon 24.01 279 171 9 398 668 26316 27053 6e-13 74.7 | ||
| + | BUSSELTON_g29060.t1 Seq_35_pilon_pilon 32.67 150 83 6 531 678 24051 23650 1e-07 57.4 | ||
| + | BUSSELTON_g29223.t1 Seq_17_pilon_pilon 68.89 90 27 1 594 683 498684 498950 8e-44 137 | ||
| + | BUSSELTON_g29223.t1 Seq_17_pilon_pilon 77.14 35 8 0 684 718 498950 499054 8e-44 62.0 | ||
| + | #&& mean AND, both conditions must be filled | ||
| + | </ | ||
| + | |||
| + | |||
| + | ex: printing a line if the name of the query contains " | ||
| + | < | ||
| + | user$ awk -F " | ||
| + | BUSSELTON_g28320.t1 Seq_26_pilon_pilon 45.45 66 34 2 27 92 266496 266305 2e-07 57.4 | ||
| + | BUSSELTON_g29060.t1 Seq_133_pilon_pilon 24.01 279 171 9 398 668 26316 27053 6e-13 74.7 | ||
| + | BUSSELTON_g29060.t1 Seq_35_pilon_pilon 32.67 150 83 6 531 678 24051 23650 1e-07 57.4 | ||
| + | BUSSELTON_g29223.t1 Seq_17_pilon_pilon 46.67 195 79 1 1103 1272 499049 499633 9e-49 193 | ||
| + | BUSSELTON_g29223.t1 Seq_17_pilon_pilon 68.89 90 27 1 594 683 498684 498950 8e-44 137 | ||
| + | BUSSELTON_g29223.t1 Seq_17_pilon_pilon 77.14 35 8 0 684 718 498950 499054 8e-44 62.0 | ||
| + | BUSSELTON_g29223.t1 Seq_17_pilon_pilon 37.75 151 93 1 1381 1531 499664 500113 6e-23 108 | ||
| + | #the 2 pipes caracters || mean OR, either conditions must be filled | ||
| + | </ | ||
| + | |||
| + | |||
| + | How to use the if and else statments \\ | ||
| + | ex: printing the first column of a line if the query (first column) contains " | ||
| + | < | ||
| + | user$ awk -F " | ||
| + | BUSSELTON_g28320.t1 Seq_26_pilon_pilon 45.45 66 34 2 27 92 266496 266305 2e-07 57.4 | ||
| + | BUSSELTON_g29060.t1 | ||
| + | BUSSELTON_g29060.t1 | ||
| + | BUSSELTON_g29223.t1 | ||
| + | BUSSELTON_g29223.t1 | ||
| + | BUSSELTON_g29223.t1 | ||
| + | BUSSELTON_g29223.t1 | ||
| + | </ | ||
| + | |||
| + | How to make numeric operations on certain fields \\ | ||
| + | ex 1: printing the column 2, the column 10 and the column 10 -100 | ||
| + | < | ||
| + | user$ awk -F " | ||
| + | Seq_26_pilon_pilon 266305 266205 | ||
| + | Seq_133_pilon_pilon 27053 26953 | ||
| + | Seq_35_pilon_pilon 23650 23550 | ||
| + | Seq_17_pilon_pilon 499633 499533 | ||
| + | Seq_17_pilon_pilon 498950 498850 | ||
| + | Seq_17_pilon_pilon 499054 498954 | ||
| + | Seq_17_pilon_pilon 500113 500013 | ||
| + | </ | ||
| - | How to print a line if the name of the query (first column) | + | ex 2: if the column 10 is greater than the column 9, print the column 2, the column 9 -100 and the column 9; else (if the column |
| + | < | ||
| + | user$ awk -F " | ||
| + | Seq_26_pilon_pilon 266396 266496 | ||
| + | Seq_133_pilon_pilon 27053 27153 | ||
| + | Seq_35_pilon_pilon 23951 24051 | ||
| + | Seq_17_pilon_pilon 499633 499733 | ||
| + | Seq_17_pilon_pilon 498950 499050 | ||
| + | Seq_17_pilon_pilon 499054 499154 | ||
| + | Seq_17_pilon_pilon 500113 500213 | ||
| + | </ | ||
| - | How to print a line if the start of the hit in the target sequence is greater than XXX | ||
awk_for_tabulated_files.1624462165.txt.gz · Last modified: by 71.7.206.175
