phylogeny_protocol3
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| phylogeny_protocol3 [2021/10/10 15:12] – 134.190.232.9 | phylogeny_protocol3 [2021/10/10 23:01] (current) – 134.190.232.9 | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| - | 「Ongoing」 Here is an example of how to run it for protein IDs (-p):\\ | + | This part, basically, we are trying to figure out how to acquire the taxa for the genes in MMETSP and NCBI database. First, we display the acc2tax software usage here: |
| + | **1. Usage of Acc2tax** | ||
| + | < | ||
| + | #acc2tax https:// | ||
| + | # Given a file of accessions or Genbank IDs (one per line), this program will return a taxonomy string for each. | ||
| + | # http:// | ||
| + | |||
| + | #Input file | ||
| + | CDH53707 | ||
| + | XP_011133305 | ||
| + | KMU79707 | ||
| + | XP_002963095 | ||
| + | |||
| + | #Outputfile | ||
| + | CDH53707 cellular organisms, | ||
| + | XP_011133305 cellular organisms, | ||
| + | KMU79707 cellular organisms, | ||
| + | </ | ||
| + | |||
| + | < | ||
| | | ||
| + | </ | ||
| - | don't forget to make sure that your input file contains only the accession numbers without their version, see the example file given above. | + | Note: don't forget to make sure that your input file contains only the accession numbers without their version, see the example file given above. |
| - | :-( E.g., MBI4782295.1 shall be MBI4782295, otherwise the bugs will occur: | + | E.g., MBI4782295.1 shall be MBI4782295, otherwise the bugs will occur: |
| < | < | ||
| Line 13: | Line 33: | ||
| </ | </ | ||
| - | :-D Trim the version " | + | Trim the version " |
| < | < | ||
| Line 35: | Line 55: | ||
| </ | </ | ||
| - | ========================== | + | < |
| - | acc2tax database:\\ | + | # acc2tax database |
| - | + | ||
| / | / | ||
| - | \\ | + | </ |
| - | MMETSP | + | **2. MMETSP |
| - | 7. New MMETSP database was used containing the taxonomy information. | + | < |
| + | #New MMETSP database was used containing the taxonomy information. | ||
| + | dir: >/ | ||
| + | fasta: > | ||
| + | HHYGDSHFBSJBSCJSJKCHSFBSMCNSBCMBSM | ||
| - | >/ | ||
| - | > | ||
| # VS | # VS | ||
| - | >/ | + | dir: >/ |
| + | fasta: > | ||
| + | SDFHSJFBSNVMSNVMSBHVDBCDMSNCSKFNB | ||
| + | </ | ||
| - | > | + | * Reducing the redundancy of MMETSP and NCBI-nr. CD-HIT |
| - | + | ||
| - | #The directory of shared folder in Perun: | + | |
| - | # The BLAST -max_target_seqs option does work on yielding all the hits for each Clp in MMETSP and NCBI-Nr. | + | |
| - | + | ||
| - | + | ||
| - | 8. Reducing the redundancy of MMETSP and NCBI-nr. CD-HIT | + | |
| - | + | ||
| - | You are right, I am working on reducing the redundancy of MMETSP. Bruce suggested a tool (CD-HIT) which works well for reducing the redundancy of the cases you metioned in MMETSP. Taking AT5G15450.1 as example, the MMETSP BLAST search hits was reduced from 502 to ~150. I then gave a first look at what species matching with AT5G15450.1 (E value < | + | |
| + | < | ||
| > cd-hit-est -i out_AT5G15450.1_hits.fa -o AT5G15450.1_clp -c 0.8 -n 10 | > cd-hit-est -i out_AT5G15450.1_hits.fa -o AT5G15450.1_clp -c 0.8 -n 10 | ||
| - | • CD-HIT | + | #CD-HIT |
| - | CD-HIT is a widely used program for clustering biological sequences to reduce | + | #sequence identity threshold, default 0.9 |
| - | + | ||
| - | -csequence | + | |
| this is the default cd-hit' | this is the default cd-hit' | ||
| | | ||
| | | ||
| - | + | #nword_length, | |
| - | -nword_length, | + | </ |
| + | **3. Trivia ** | ||
| - | 9. acc2tax https:// | + | < |
| - | # Given a file of accessions or Genbank IDs (one per line), this program will return a taxonomy string for each. | + | |
| - | # http:// | + | |
| - | + | ||
| - | #Input file | + | |
| - | CDH53707 | + | |
| - | XP_011133305 | + | |
| - | KMU79707 | + | |
| - | XP_002963095 | + | |
| - | + | ||
| - | # | + | |
| - | CDH53707 cellular organisms, | + | |
| - | XP_011133305 cellular organisms, | + | |
| - | KMU79707 cellular organisms, | + | |
| - | + | ||
| - | + | ||
| - | 10. Multi-gene phylogeny tree | + | |
| - | http:// | + | |
| - | + | ||
| - | Manually: (i.e. taxon has an unexpected placement on the tree) or that generate extra-long branches. | + | |
| - | Operational taxonomic unit (OTU) is an operational definition used to classify groups of closely related individuals. | + | |
| # acquire the taxon info. | # acquire the taxon info. | ||
| - | >grep ' | + | >grep ' |
| #acquire the length of fasta file. | #acquire the length of fasta file. | ||
| - | awk '/ | + | awk '/ |
| - | + | </ | |
| - | 11. Refining the trees | + | |
| + | < | ||
| + | # MMETSP DATASET | ||
| CP_0198131824_1486917_Craspedostauros_australis_CCMP3328 | CP_0198131824_1486917_Craspedostauros_australis_CCMP3328 | ||
| Line 116: | Line 109: | ||
| CP_0174260800_1078864_Stereomyxa_ramosa_Chinc5 (0) | CP_0174260800_1078864_Stereomyxa_ramosa_Chinc5 (0) | ||
| - | If for some reason the source organism cannot be mapped to the taxonomy | + | # If for some reason the source organism cannot be mapped to the taxonomy |
| database, the column will contain 0. | database, the column will contain 0. | ||
| + | </ | ||
| + | * Links | ||
| - | NCBI-NR | + | < |
| - | + | ||
| - | 12. Taxon in NCBI-nr (As of Sep 1st.) | + | |
| - | + | ||
| - | • | + | |
| - | grep ' | + | |
| - | grep " | + | |
| - | + | ||
| - | • Links | + | |
| https:// | https:// | ||
| https:// | https:// | ||
| Line 135: | Line 122: | ||
| names.dmp, node.dmp | names.dmp, node.dmp | ||
| - | • JGI Taxonomy Guide | + | # JGI Taxonomy Guide |
| https:// | https:// | ||
| wget ftp:// | wget ftp:// | ||
| gi_taxid_nucl.dmp | gi_taxid_nucl.dmp | ||
| + | </ | ||
| - | + | * Directories | |
| - | • Directories | + | |
| + | < | ||
| / | / | ||
| / | / | ||
| Line 149: | Line 137: | ||
| nohup zcat dead_nucl.accession2taxid.gz dead_wgs.accession2taxid.gz nucl_gb.accession2taxid.gz nucl_wgs.accession2taxid.gz |sort > nucl_all.txt | nohup zcat dead_nucl.accession2taxid.gz dead_wgs.accession2taxid.gz nucl_gb.accession2taxid.gz nucl_wgs.accession2taxid.gz |sort > nucl_all.txt | ||
| - | |||
| nohup zcat dead_prot.accession2taxid.gz prot.accession2taxid.gz | sort > prot_all.txt | nohup zcat dead_prot.accession2taxid.gz prot.accession2taxid.gz | sort > prot_all.txt | ||
| + | </ | ||
| - | • acc2tax | + | * What is the difference of the GenPept format and the GenPept (full)? |
| - | | + | |
| - | Given a file of accessions or Genbank IDs (one per line), this program will return a taxonomy string for each. | + | < |
| - | Lookup for Genbank IDs is quicker than for accessions, as the lookup table is stored in RAM (though this does mean it takes a couple of minutes to load). For accessions, the lookup is from disc. | + | |
| - | + | ||
| - | Provide batch taxonomy information for Genbank IDs or Accessions. | + | |
| - | Options: | + | |
| - | [-h | --help] | + | |
| - | [-a | --accession] | + | |
| - | [-c | --column] | + | |
| - | [-d | --database] | + | |
| - | [-e | --entries] | + | |
| - | [-g | --gi] Query is Genbank IDs. | + | |
| - | [-i | --input] | + | |
| - | [-k | --keep] | + | |
| - | [-n | --nucleotide] Query IDs are nucleotide [default]. | + | |
| - | [-o | --output] | + | |
| - | [-p | --protein] | + | |
| - | [-s | --strip] | + | |
| - | + | ||
| - | + | ||
| - | + | ||
| - | • What is the difference of the GenPept format and the GenPept (full)? | + | |
| Full | Full | ||
| Accession.version taxid | Accession.version taxid | ||
| Line 187: | Line 154: | ||
| A0A023FBW4 A0A023FBW4.1 34607 1939884164 | A0A023FBW4 A0A023FBW4.1 34607 1939884164 | ||
| A0A023FBW7 A0A023FBW7.1 34607 1939884197 | A0A023FBW7 A0A023FBW7.1 34607 1939884197 | ||
| + | </ | ||
| 「 :shift + [ at PinYin keyboard | 「 :shift + [ at PinYin keyboard | ||
| + | |||
| 」: shift + ] at PinYin keyboard | 」: shift + ] at PinYin keyboard | ||
phylogeny_protocol3.1633889570.txt.gz · Last modified: by 134.190.232.9
