cgeb2001's DokuWiki!

Don’t waste it: tidy up your bioinformatics work into appropriate publications

I was writing late another day at my apartment due to the common fear outside – new variant virus. Bang! Bang! (Running sound from up-floor) Wow! Wow! (Yelling sound from downstairs). Obviously, I was not the only person working from home. The “gym” and “night bar” were also moved in the building offering the “service”. Proficiently, I took out an A4 paper from a drawer and followed each step sorted in a pipeline. In a few minutes, the mid-night noises were quieted down and the A4 paper titled with “Protocol for non-nature noise control in nocturnal animal” was safely put back.

The fast-evolving pandemic did bring challenges for a friendly and quiet neighborhood community, especially when people spending more time at home offering the chances disturbing others’ rest time. However, few publications can be found to work through the specific noise complain steps including the troubleshooting section. With a single Google search, most suggestions are reaching out the building manager, calling the police or even knocking back the noise. I am not trying to advertise my “noise control A4 paper” which did work though, but as a bioinformatician has been involved into many projects. There are better ways to spread the mid-steps’ work such as which method you choose, how you proceed the analysis, what if the outcome is unexpected.

Dealing with bioinformatics projects can produce many challenges. Overcoming these challenges means progress. And surely, there is bonus pay throughout this process. I was involved into an Antarctic green alga genome sequencing project during my PhD, which has been wrapped as a genome paper in a standard manner including genome sequencing, assembling and annotating. However, the real exploration is not limited to the genome data itself. Many mid-step efforts have been turned into bioinformatic publication. For example, when I was exploring the gene duplication in Chlamydomonas genomes, it was challenging to analyze the protein BLAST all-against-all results. Especially I wanted to filter to only those duplicates with near-identical protein lengths (within certain amino acids) and certain pairwise identities. Therefore, to extensively identify, categorize and visualize highly similar duplicates (HSDs) with high accuracy and reliability, I developed a web-based tool HSDFinder [#]. To spread the mid-step bioinformatics work, I, firstly, wrapped up a hands-on HSDFinder protocol article to assist other researchers proceeding the similar analysis. Then, I was encouraged by academic peers to detail the principle, simplify the workflow and finally publish a near 7000-word software article [#]. Using the bioinformatics web tool, users have the option to employ different parameters (from 30% to 100% identity and from within 0-100 aa variances) for identifying HSDs. What’s more, I was curious about the number of HSDs in other chlorophyte algae and if trends could be found among different species. The predicted results were then documented in the database of HSDatabase [#], which contain a total of 42,884 HSDs in fifteen eukaryotes so far. Although it took me a lot of effort to finish up the genome project paper, I was benefited in a long run. However, if I decided to throw away the mid-step analysis, I doubted one day I could remember every specific steps and thereby missing the willingness to draft the relevant publications after.

The similar things happened in other projects as well. Functional annotations of protein-coding genes can be annoying when obtaining the best BLAST hits from some non-redundant protein sequence database such as NCBI NR databases, SwissProt [#] and TrEMBL[#], because of the hypothetical and uncharacterized proteins might pop up at the top list. Therefore, I wrapped up a hands-on protocol paper called NoBadWordsCombiner v1.0 [#] to demonstrate how to automatically merge the BLAST results from the eukaryotic databases. More importantly, it can strengthen the gene definition by minimizing those protein function descriptions containing ‘bad words’, such as hypothetical and uncharacterized proteins. In a multiple species’ phylogeny project, I wrapped up my mid-step bioinformatics work into a hands-on protocol TreeTuner [#], which is very helpful for those researchers who are willing to explore the tree diversity and then more rigorous downstream re-analysis on specific OTUs.

Nowadays, many bioinformatics related questions could be found via Biostars, GitHub, and Stack Overflow websites etc. This is same to seek the noise complain questions from Google, Reddit or other discussion platforms. However, will the Google scholar be ready for more mid-step bioinformatics papers? Will it be the trend to see more downstream polished bioinformatics papers, such as InterProScan_parser, KEGG_decipher or NCBI_explorer. I don’t know the answers, but surely the author of this paper are looking for ways to publish the “noise control A4 paper”.