cgeb2001's DokuWiki!

This is an old revision of the document!

Indexing sequences

What if you have a dataset of interested gene names/IDs, and you want to figure out the homologous and paralogous related genes. The easiest way is to blast against NCBI, but wait! Where to find the protein or nucleotide sequences of your interested genes. Sure! NCBI name search could be the way, but what if your gene ID is not from NCBI and you have thousands of interested genes, e.g., At2G01130, AT4G00010, etc., these are from TAIR10 database (A.thaliana). You are going to need this resource guide. Note: not listed all excellent resource.

Software resource:

Batch entrez
index_header_to_seq

Batch entrez is a web tool (https://www.ncbi.nlm.nih.gov/sites/batchentrez), which is very easy to retrieve FASTA file, especially when you are dealing with batch gene sequence search.

Prepare entrez ID file and upload
Select the ID database
Retrieve and save results in FASTA format

Note: The only thing is user will need to prepare a list of Entrez accession numbers or other identifiers file recognized by NCBI, e.g., QKK48680.1 which is a GenBank ID in protein database. User might get following error if select wrong database and not recognized NCBI ID.

An illegal character in a token. Possible wrong file format. Request processing canceled.

As mentioned in the very beginning, if your interested gene name OR ID has nothing to do with NCBI and you need the Fasta sequence. There is a simple way to do this via a custom script index_header_to_seq.py (https://github.com/zx0223winner/HSDatabase/blob/master/index_header_to_seq.py)

python3 index_header_to_seq.py database_sequence.fasta name_list.txt out_name_listed_seq.fa 

# 'database_sequence.fasta' includes all the sequences of your explored species, e.g., TAIR10 (Athaliana_167_TAIR10.fa); 
# 'name_list.txt' contains name/ID of your interested genes, each gene name occupied a new line; 
# 'out_name_listed_seq.fa' is the fasta file you need.

Now, feel free to explore your interested genes via the BLAST+ and v5 database user guide (please refer http://129.173.88.134:81/dokuwiki/doku.php?id=blast_protocol).

Creating alignments, trimming alignments, building the trees.

Software resource:

Clustal Omega 1.2.3 trimAl v1.2

Clustal Omega 1.2.3 (http://www.clustal.org/omega/) trimAl v1.2 (http://trimal.cgenomics.org/trimal) (http://trimal.cgenomics.org/_media/tutorial.pdf)

building trees

Calculating dN/dS

Note: Please refer to the guide for the most updated information.