Large scale sequence analysis is a complex task that involves the integration of results from numerous computational tools. For higha€“throughput data analysis, these tools must be tied together in a coordinated system that can automate the execution of a set of analyses in sequence or in parallel. To meet these challenges, we have created Pegasysa€”a flexible, modular and customizable software system that facilitates the execution and data integration from heterogeneous biological sequence analysis tools. The system includes numerous tools for paira€“wise and multiple sequence alignment, ab initio gene prediction, RNA gene detection, and masking repetitive sequences in genomic DNA as well as filters for database formatting or processing of raw output from various analyses. We introduce a novel data structure for creating workflows of sequence analyses which allow outputs of one analysis to be used as input to a subsequent analysis.
The software allows users to dynamically create analysis workflows at runa€“time by manipulating a graphical user interface (see below). Workflows can be saved and rea€“opened for future work, or distributed as protocols that encode a methodology for a specific type of analysis. When the workflow is complete, it is sent to the Pegasys server for execution of the anlyses and integration of the results which are presented to the user in GFF, GAME XML format for integrated result, or plain text unprocessed results from the analyses in the workflow.