phylogeny_protocol3
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| phylogeny_protocol3 [2021/10/09 17:01] – 134.190.232.9 | phylogeny_protocol3 [2021/10/10 23:01] (current) – 134.190.232.9 | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| - | Here is an example of how to run it for protein IDs (-p):\\ | + | This part, basically, we are trying to figure out how to acquire the taxa for the genes in MMETSP and NCBI database. First, we display the acc2tax software usage here: |
| + | **1. Usage of Acc2tax** | ||
| + | < | ||
| + | #acc2tax https:// | ||
| + | # Given a file of accessions or Genbank IDs (one per line), this program will return a taxonomy string for each. | ||
| + | # http:// | ||
| + | |||
| + | #Input file | ||
| + | CDH53707 | ||
| + | XP_011133305 | ||
| + | KMU79707 | ||
| + | XP_002963095 | ||
| + | |||
| + | #Outputfile | ||
| + | CDH53707 cellular organisms, | ||
| + | XP_011133305 cellular organisms, | ||
| + | KMU79707 cellular organisms, | ||
| + | </ | ||
| + | |||
| + | < | ||
| | | ||
| + | </ | ||
| - | don't forget to make sure that your input file contains only the accession numbers without their version, see the example file given above. | + | Note: don't forget to make sure that your input file contains only the accession numbers without their version, see the example file given above. |
| - | :-( E.g., MBI4782295.1 shall be MBI4782295, otherwise the bugs will occur: | + | E.g., MBI4782295.1 shall be MBI4782295, otherwise the bugs will occur: |
| < | < | ||
| Line 13: | Line 33: | ||
| </ | </ | ||
| - | :-D Trim the version " | + | Trim the version " |
| < | < | ||
| Line 35: | Line 55: | ||
| </ | </ | ||
| - | ========================== | + | < |
| - | acc2tax | + | # acc2tax database directory: |
| + | / | ||
| + | </ | ||
| + | |||
| + | |||
| + | **2. MMETSP hierarchical taxonomic info** | ||
| + | |||
| + | < | ||
| + | #New MMETSP database was used containing the taxonomy information. | ||
| + | dir: >/ | ||
| + | fasta: > | ||
| + | HHYGDSHFBSJBSCJSJKCHSFBSMCNSBCMBSM | ||
| + | |||
| + | # VS | ||
| + | dir: >/ | ||
| + | fasta: > | ||
| + | SDFHSJFBSNVMSNVMSBHVDBCDMSNCSKFNB | ||
| + | </ | ||
| + | |||
| + | * Reducing the redundancy of MMETSP and NCBI-nr. CD-HIT | ||
| + | |||
| + | < | ||
| + | > cd-hit-est -i out_AT5G15450.1_hits.fa -o AT5G15450.1_clp -c 0.8 -n 10 | ||
| + | |||
| + | #CD-HIT is a widely used program for clustering biological sequences to reduce sequence redundancy and improve the performance of other sequence analyses. | ||
| + | |||
| + | #sequence identity threshold, default 0.9 | ||
| + | this is the default cd-hit' | ||
| + | | ||
| + | | ||
| + | # | ||
| + | </ | ||
| + | |||
| + | **3. Trivia ** | ||
| + | |||
| + | < | ||
| + | # acquire the taxon info. | ||
| + | >grep ' | ||
| + | |||
| + | #acquire the length of fasta file. | ||
| + | awk '/ | ||
| + | </ | ||
| + | |||
| + | < | ||
| + | # MMETSP DATASET | ||
| + | CP_0198131824_1486917_Craspedostauros_australis_CCMP3328 | ||
| + | |||
| + | CP_0198725784__Unidentified_sp_CCMP1999 | ||
| + | |||
| + | CP_0202964930__Pseudokeronopsis_sp_Brazil | ||
| + | |||
| + | CP_0174260800_1078864_Stereomyxa_ramosa_Chinc5 (0) | ||
| + | |||
| + | # If for some reason the source organism cannot be mapped to the taxonomy | ||
| + | | ||
| + | </ | ||
| + | |||
| + | * Links | ||
| + | |||
| + | < | ||
| + | https:// | ||
| + | https:// | ||
| + | |||
| + | https:// | ||
| + | names.dmp, node.dmp | ||
| + | |||
| + | # JGI Taxonomy Guide | ||
| + | |||
| + | https:// | ||
| + | wget ftp:// | ||
| + | gi_taxid_nucl.dmp | ||
| + | </ | ||
| + | |||
| + | * Directories | ||
| + | |||
| + | < | ||
| + | / | ||
| + | / | ||
| + | / | ||
| + | |||
| + | nohup zcat dead_nucl.accession2taxid.gz dead_wgs.accession2taxid.gz nucl_gb.accession2taxid.gz nucl_wgs.accession2taxid.gz |sort > nucl_all.txt | ||
| + | nohup zcat dead_prot.accession2taxid.gz prot.accession2taxid.gz | sort > prot_all.txt | ||
| + | |||
| + | </ | ||
| + | |||
| + | * What is the difference of the GenPept format and the GenPept (full)? | ||
| + | |||
| + | < | ||
| + | Full | ||
| + | Accession.version taxid | ||
| + | 0308206A 8058 | ||
| + | 0308221A 9606 | ||
| + | 0308230A 1049 | ||
| + | |||
| + | accession accession.version taxid gi | ||
| + | A0A009IHW8 A0A009IHW8.1 1310613 1835922267 | ||
| + | A0A023FBW4 A0A023FBW4.1 34607 1939884164 | ||
| + | A0A023FBW7 A0A023FBW7.1 34607 1939884197 | ||
| + | </ | ||
| + | |||
| + | 「 :shift + [ at PinYin keyboard | ||
| - | / | + | 」: shift + ] at PinYin keyboard |
| - | /misc/ | + | # : command +/ |
| - | \\ | + | |
| <Last updated by Xi Zhang on Oct 9th, | <Last updated by Xi Zhang on Oct 9th, | ||
phylogeny_protocol3.1633809690.txt.gz · Last modified: by 134.190.232.9
