Differences

This shows you the differences between two versions of the page.

--- handy_custom_functions [2021/05/12 14:20] – 168.91.18.151
+++ handy_custom_functions [2023/07/25 12:08] (current) – 134.190.232.186
@@ Line 11: / Line 11: @@
 I will discuss here some more custom functions that I found are very useful in my daily workflow. To add these functions to your system, simply add them to your ''.bashrc''
-===Selecting or removing sequences from a FASTA file===
+===Reformatting FASTA files downloaded from NCBI===
+For most of my analyses, the header format of NCBI FASTA files is very annoying. This function will convert the annoying format into ''>SpeciesName_i_AccessionNumber''
 <code>
-# fish out a sequence from a fasta file
+# format NCBI headers to something readable
-function grabseq {
+function reformat_ncbi_headers {
-    fasta=$3
+    newfile=${1%.fasta}.hdfmt.fasta
-    to_grab=$2
+    cp $1 $newfile
-    case "$1" in
+    sed -i -r -e '/^>/ s/ >.*//' -e 's/>([^ ]*).*\[(.*)\]/>\2_i_\1/' -e 's/ /_/g' $newfile
-        -s) seqtk subseq $fasta <(grep    "$to_grab" $fasta | sed -e 's/>//' -e 's/ .*//');;
+    sed -i -r -e '/^>/ s/gi.*ref\|(.*)\|/\1/' $newfile
-        -l) seqtk subseq $fasta <(grep -f "$to_grab" $fasta | sed -e 's/>//' -e 's/ .*//');;
-        *) echo -e "grabseq -s <seqname> <fasta> to grab a single sequence\ngrab_seq -l <seqlist> <fasta> to grab a list of sequences"
-    esac
 }
 </code>
-This is essentially a wrapper for the [[https://github.com/lh3/seqtk|seqtk]] tool.
+===Replacing work names with final names for publication===
-<code>
+In my experience I do my analyses with new genomes / transcriptomes etc I work with 'worknames'. For example 'bin125' or 'L4', or 'BBO'. That is fine while you do your analyses, but in the end when you want to publish your work you'll want to have proper names. This function takes a mapping file that contains the short / worknames in the first column and the final names in the second column, and replaces the worknames with the final names in tree files, FASTA files, you name it.
-# select any sequence that has <pattern> in the header
-# useful if you want to find a single sequence
-$ grabseq -s <pattern> <FASTA> > <NEWFASTA>
-# select a particular set of sequences that have <pattern> in their header and are in the <pattern_list> file
+<code>
-# useful if you want to multiple sequences
+# replace taxanames in trees, fasta, etc
-$ grabseq -l <pattern_list> <FASTA> > <NEWFASTA>
+function replace_names {
+    input=$1
+    mappingfile=$2
+    cp $input $input.nms
+    cat $mappingfile | while read SEARCH REPLACE; do
+        sed -i -r "s/$SEARCH/$REPLACE/" $1.nms
+    done
+}
 </code>
-The next function does the opposite of grabseq. It will remove particular sequences from a FASTA file.
+===Some other functions===
 <code>
-# remove a particular entry from a fasta file
+# fasta to phylip
-function rmseq {
+# depends on trimal
-    fasta=$3
+function fa2phy {
-    to_rmv=$2
+    trimal -in $1 -out ${1%.*}.phylip -phylip
-    case "$1" in
-        -s) seqtk subseq $fasta <( grep ">" $fasta | grep -E -v "$to_rmv" | sed "s/>//" );;
-        -l) seqtk subseq $fasta <( grep ">" $fasta | grep -v -f "$to_rmv" | sed "s/>//" );;
-        *) echo -e "rmseq -s <seqname> <fasta> to remove a single sequence\nremove_seq -l <seqlist> <fasta> to remove a list of sequences"
-    esac
 }
-</code>
-<code>
+# reverse complement function
-# remove a particular sequence that has <pattern> in the header
+function revcomp {
-$ rmseq -s <pattern> <FASTA> > <NEWFASTA>
+    tr "[ATGCatgcNn]" "[TACGtacgNn]" | rev
+}
-# remove a particular set of sequences that have <pattern> in the header and are in the <pattern_list> file
+# sum up all numbers in a list
-$ rmseq -l <pattern_list> <FASTA> > <NEWFASTA>
+function total {
+    tr '\n' '+' $1 | sed "s/\+$/\n/" | bc
+}
 </code>