Bungou Stray Dogs | Drifters | Monster Squad, Book 6: Wayward Son - Heath Stallcup - 2016 (Sci-Fi) [Audiobook] (miok) [WWRG]

The All-Species Living Tree Project

The All-Species Living Tree Project

CHAPTER The All-Species Living Tree Project 3 Pablo Yarza*,1,2, Raul Munoz{,2 *Ribocon GmbH, Bremen, Germany Marine Microbiology Group, Department ...

465KB Sizes 0 Downloads 7 Views

Recommend Documents

Development of an outdoor MRI system for measuring flow in a living tree
•A 0.2 T, outdoor MRI apparatus for measurements of flow in a living tree was developed.•The system was robust enough fo

Applying hybrid Monte Carlo Tree Search methods to Risk-Aware Project Scheduling Problem
In this paper we investigate an application of hybrid Monte Carlo Tree Search (MCTS) based algorithms to solving dynamic

The Influence of Project Managers on Project Success Criteria and Project Success by Type of Project
The importance attached by project managers to project success criteria and the associated rates of project success were

The tree MVA algorithm
A new algorithm to solve product form queueing networks, especially those with large numbers of centers and chains, is p

The suffix binary search tree and suffix AVL tree
Suffix trees and suffix arrays are classical data structures that are used to represent the set of suffixes of a given s

The Tree Property
We construct a model in which there are no ℵn-Aronszajn trees for any finiten⩾2, starting from a model with infinitely m

The gene tree delusion
•Empirical data suggest coalescence-genes are tiny owing to the recombination ratchet.•Coalescence methods have not solv

CHAPTER

The All-Species Living Tree Project

3

Pablo Yarza*,1,2, Raul Munoz{,2 *Ribocon GmbH, Bremen, Germany Marine Microbiology Group, Department of Ecology and Marine Resources, Institut Mediterrani d’Estudis Avanc¸ats (CSIC-UIB), Esporles, Illes Balears, Spain 1 Corresponding author: e-mail address: [email protected]

{

1 INTRODUCTION Data acquisition, data processing and scientific developments are leading to a rapid accumulation of digital information for microbiological purposes. The fact that the 16S rRNA gene sequence repository has grown exponentially since the early 1990s, currently exceeding 3,500,000 entries, is a good example of this tendency (Figure 1). In recent times, the study of microbiomes has enhanced activities related to the classification and identification of microorganisms, boosting the number of sequence submissions to public databases of the International Nucleotide Sequence Database Collaboration (INSDC). This ever-increasing accumulation of information has promoted the active networking of microbiologists and computer scientists leading to the development of tools and infrastructure to store, access and analyse data. The classification of Archaea and Bacteria provides a sound foundation for microbiology. It has a major application in all information databases with respect to the need to handle valid taxon identifiers. Users and providers of taxonomic information (i.e. classification and nomenclature) constitute a large community ranging from individual researchers (e.g. taxonomists) to computerized systems (e.g. DNA sequence repositories). For practical reasons related to informative content and availability, the current classification of Archaea and Bacteria is based on genealogical patterns inferred from comparative analyses of 16S rRNA genes (Ludwig, Glo¨ckner, & Yilmaz, 2011). Nevertheless, a shift in the general standards for prokaryotic classification is expected when the catalogue of complete genomes becomes sufficiently comprehensive (Klenk & Go¨ker, 2010; Richter & Rossello´-Mo´ra, 2009), though this is not yet the case (Figure 1). The often incomplete or absent taxonomic information attached to the gene sequence (e.g. organism name) has propagated from primary nucleotide repositories (i.e. INSDC), highlighting the need for curation. 2

For the LTP consortium.

Methods in Microbiology, Volume 41, ISSN 0580-9517, http://dx.doi.org/10.1016/bs.mim.2014.07.006 © 2014 Elsevier Ltd. All rights reserved.

45

46

CHAPTER 3 Living Tree Project (LTP)

FIGURE 1 Annual cumulative growth of databases. 16S ribosomal RNA (light grey, to the left axis) was obtained from SILVA release 114 (http://www.arb-silva.de). Genome projects (black, to the right axis) refer to complete genome projects on Archaea and Bacteria according to the GOLD database (http://www.genomesonline.org). Names of prokaryotic species and subspecies with standing in nomenclature since 1980 (dark grey, to the right axis) were obtained from LPSN (http://www.bacterio.net). The total number of names is around 2000 greater than the total number of distinct type strains due to homotypic synonyms, new combinations, nomina nova and later heterotypic synonyms, or illegitimate names.

The ‘All-Species’ Living Tree Project (LTP) started in 2007, given the demand for a reliable taxonomic resource based on rRNA gene sequence data (Yarza et al., 2008). The LTP is an initiative coordinated by the executive editors of the journal Systematic and Applied Microbiology for the development of a taxonomic tool encompassing updated sequence databases, alignments and phylogenetic trees of the type strains of hitherto described species of Archaea and Bacteria. The LTP project counted on the help of four additional partners to set up an informatics infrastructure and a training environment for curators. These partners are (1) LPSN (Euze´by, 1997; www.bacterio.net) providing support on taxonomic nomenclature; (2) ARB (Ludwig et al., 2004; www.arb-home.de) for support on phylogenetics and classification; (3) SILVA (Quast et al., 2013; www.arb-silva.de) supplying sequence databases, computational resources and Web hosting; and (4) Ribocon (www.ribocon. com) for training and database management.

2 SOURCES OF INFORMATION

2.1 CLASSIFICATION OF MICROBIAL DATABASES Overall, the multiple kinds of data together with the distinct requirements of particular users have defined a network through which information flows, gaining specificity and integration (Figure 2). The process happens among collaborating microbial

2 Sources of Information

FIGURE 2 Data flow between microbial databases. Three main activities in microbiology are the description of taxa (A); provision of associated sequence data (B); and the storage of strains and information into Biological Resource Centres (C). These tasks constitute the primary-level information (1). Through data selection and curation, specific information becomes available as secondary resources (2). Further integration of higher resources leads to tertiary-level information, with even more specific and optimized databases (3). Arrows indicate main information flows.

databases which can be classified into three categories according to the quantity and refinement of their data. At the first level, big infrastructures exist to preserve the data generated by three main activities in microbiology: (a) description of microorganisms and microbial communities, (b) sequencing of strains and (c) their preservation in Biological Resource Centres (BRCs) (Figure 2). The INSDC-member databases, for example, can be regarded as primary repositories of sequence data. Secondary infrastructures have arisen to provide high quality and specific software platforms and databases, for example, rRNA gene databases like SILVA (Figure 2). One of the main aims of these rRNA resources is to continuously upgrade data according to changes in primary repositories (e.g. new submissions to INSDC) while maintaining quality and reliability. These kinds of secondary databases have already become so large that most of their resources are invested in system’s maintenance and the development of tools to increase usability and analysis capability; hence, manually supervised tasks are increasingly devolved to independent dedicated teams. Tertiary resources have narrowed their focus and reduced the size and complexity of their databases. Here, the largest investment is given to manual supervision tasks performed by expert curators in order to complete or correct information gathered from primary and secondary databases. The curation of rRNA gene sequence data (e.g. LTP project), for instance, comprises a search of interesting sequence entries for a specific purpose (e.g. taxonomy ! type strains) and the incorporation of added

47

48

CHAPTER 3 Living Tree Project (LTP)

value like curated metadata and sequence alignment. The ultimate goal of tertiary resources is twofold: to provide a reference tool for a given user community and to have an impact on secondary- and primary-level databases which acquire the curated data and thereby improve their service quality (Figure 2).

2.1.1 Taxonomy (LPSN and Bergey’s Manual) The existence of an official nomenclature for Archaea and Bacteria is one of the most important achievements in microbiology in recent times. It represents a global agreement for the naming of prokaryotes, with strong implications for scientific communication. According to the International Committee on Systematics of Prokaryotes, the International Journal of Systematic and Evolutionary Microbiology (IJSEM) is the official journal for the publication of validly published archaeal and bacterial names, thereby providing a primary resource for microbial systematics. Up to December 2012, nearly 15,000 names of prokaryotic taxa (of any rank) had been published; this number has grown at a constant rate of about 750 names per year since 2006 (Figure 1). In 1997, the List of Prokaryotic Names with Standing in Nomenclature (LPSN; http://www.bacterio.net/; Euze´by, 1997) was created to cover the past and present nomenclature of each published prokaryotic taxon into one single Web resource. LPSN became a highly respectable and specialized secondary resource which substantially improves access to taxonomic information. The LPSN provides information on the latest valid nomenclature for each taxon, the nomenclatural type and its taxonomic classification, related publications and taxonomic opinions. Dr. Aidan Parte is the current curator responsible for the LPSN (Parte, 2014). Although classification of Archaea and Bacteria is not officially regulated, the inference of genealogical relationships based on the ‘molecular clock’ concept, particularly the gene of the small subunit of ribosomal RNA (SSU or 16S rRNA), provided the key for the classification of prokaryotes based on natural relationships (Amann, Ludwig, & Schleifer, 1995; Fox, Pechman, & Woese, 1977; Ludwig & Schleifer, 1994). In 2001, the second edition of the Bergey’s Manual of Systematic Bacteriology gave the phylogenetic backbone of the prokaryotes, by providing an updated and emended framework for prokaryotic classification based on rRNA (Garrity & Holt, 2001). Bergey’s Taxonomic Outlines have been subsequently updated to include new published species and additional sequence data (Ludwig et al., 2012; Ludwig, Euze´by, & Whitman, 2010; Ludwig, Schleifer, & Whitman, 2009).

2.1.2 Type-strain information (StrainInfo database) BRCs act as long-term reservoirs for cultivable microorganisms (e.g. DSMZ – Deutsche Sammlung von Mikroorganismen und Zellkulturen). They deal with the authentication, safe preservation and supply of deposited cell material and operate according to international laws regarding health and safety requirements, quarantine regulations, intellectual property rights and classification of microorganisms into hazard groups (see www.wfcc.info, for more information). The name of each archaeal and bacterial species has to be validly published to confirm to the Bacteriological Code of Nomenclature (Lapage et al., 1992) and has to be represented by a

2 Sources of Information

nomenclatural type, that is, a viable and culturable strain to which the name is permanently linked. This is the reason why the circumscription of a new species needs to be based on a careful comparison between the new isolated strains and the type strains of genealogically related species. In order to allow exploration and unlimited access to the taxon’s phenotypic and genotypic characteristics, it is mandatory that type strains should be deposited into two internationally recognized service culture collections, in two different countries (Tindall, Ka¨mpfer, Euze´by, & Oren, 2006). Finding information (e.g. sequence data) for a particular type strain can be hampered for two main reasons: (I) different stocks of the same type strain held in different collections are cited differently (e.g. ATCC 9001, CECT 515, DSM 30083 for the Escherichia coli type strain), leading to syntactical variation, synonymy and homonymy between culture identifiers and (II) type strains coexist within a catalogue of nearly 1 million Archaea and Bacteria available in 600 BRCs worldwide (June 2013; http://www.wfcc.info/ccinfo/statistics/). In 2005, the StrainInfo database was created with the aim of integrating information of all strains held in BRCs into a single online catalogue (Dawyndt, Vancanneyt, De Meyer, & Swings, 2005) with the public sequence entries available for those cultures. As a secondary information resource, the main features of StrainInfo are the capability to automatically resolve equivalent strain identifiers and to link them with external taxonomic resources of sequence data and publications, thereby allowing efficient integrated studies of type strains.

2.1.3 Sequences and alignments (ARB and SILVA) Three independent research groups in Europe, the United States and Australia (SILVA, Pruesse et al., 2007; RDP-II, Cole et al., 2007; greengenes, DeSantis et al., 2006) emerged with the aim to (1) provide the scientific community with updated universal alignments for optimal and comparable phylogenetic reconstructions; (2) produce and maintain curated datasets of nearly full-length rRNA sequences to be used for in-depth phylogenetic analyses; and (3) develop a set of bioinformatic tools for online sequence data management and analyses. SILVA databases were generated to meet the need for reference, comprehensive, quality checked and regularly updated datasets of aligned 16S/18S (SSU) and 23S/ 28S (LSU) rRNA gene sequences of Archaea, Bacteria and Eukarya (Quast et al., 2013). SILVA maintains the universal rRNA alignments for the ARB software package (Ludwig et al., 2004), one of the most comprehensive tools for phylogenetic analysis. These alignments have been manually revised taking into account the functional and evolutionary constraints given by the molecule’s secondary structure (helix- and stem-loop structures) (Ludwig & Schleifer, 1994), while maximizing the positional orthology (i.e. needed to obtain reliable and comparable phylogenies) (Peplies, Kottmann, Ludwig, & Glo¨ckner, 2008; Schloss, 2010). Two SILVA databases (PARC and REF) are available with distinct quality standards. Critical quality parameters include sequence length, ambiguities, homopolymers, chimaera probability and alignment quality criteria. In the SSU PARC database (3.8 million entries in SILVA release 115), all sequences with lengths

49

50

CHAPTER 3 Living Tree Project (LTP)

above 300 bp are retained, whereas in SSU REF databases (1.5 million entries) the cutoff is 900 bp for archaeal and 1200 bp for bacterial sequences. In addition, sequence entries in SILVA are enriched with additional metadata obtained from other resources, for instance, strain information (EMBL, LTP, StrainInfo), taxonomic classification (EMBL, greengenes, RDP-II, LTP) and curated habitat descriptors (megx. net) (see Quast et al., 2013, for more details). The complete SSU rRNA gene sequence dataset has grown at a nearly exponential rate since the early 1990s, compared to the arithmetic growth of species descriptions (Figure 1).

3 DATABASE CREATION AND UPDATING The LTP started as a manually supervised process designed to merge existing sources of information into a new curated type-strain database (Yarza et al., 2008): 1. From LPSN, a non-redundant list of names was created to represent all the hitherto described species and subspecies of Archaea and Bacteria. In the process, later heterotypic synonyms (e.g. Streptomyces parvisporogenes), Cyanobacteria not validly published according to the Bacteriological Code (e.g. all species within the genus Anabaena) and the Candidatus category (e.g. Candidatus Baumannia cicadellinicola) were not included. From the SILVA database, a dataset of candidate type-strain sequences was selected. SILVA entries had already been mapped to the StrainInfo database, which distinguishes type from non-type strains. All 16S rRNA gene sequence entries from the SILVA REF database tagged as type (T) or cultured (C) were selected as candidates in the preliminary dataset. Then, these sequences were manually assigned to validly published names of species or subspecies (LPSN) by means of the manual verification of type-strain identifiers. This task was hampered by the abundance of outdated sequence-associated information like the species name and/or the strain identifier. These sequence-associated metadata fields were scarcely supervised at the respective INSDC database, thus justifying the manual supervision of the complete procedure. StrainInfo was often consulted to learn updated equivalences between culture collections. 2. Several hundred type strains were missing in the initial dataset and their sequences had to be manually sought in other resources (Bergey’s Manual, EMBL, IJSEM). Indeed, even after such searches, the complete set of type strains was not fully represented. The type strains of some of these species had never been sequenced; others were represented by low-quality sequences. The species without goodquality sequences for their type strain were dubbed ‘orphan’ species. 3. At the pre-final stage, the sequence dataset under consideration was redundant because some type strains had either been repeatedly sequenced or their sequenced genomes included multiple copies of the ribosomal operon. This redundancy is not required for classification and identification purposes. Subsequently, the LTP team decided to retain only one sequence per type strain in the final dataset, namely, the one considered to be the best in terms of length,

4 Features of the Database

ambiguities, homopolymers, chimaera probability and alignment quality. However, in cases of doubt, the earliest submission to an INSDC partner was chosen. The ‘one sequence per type strain’ standard was only overlooked for a small minority of genomes with highly divergent (<98% sequence identity) copies of the rRNA operon (e.g. Haloarcula marismortui DSM 3752). The LTP database is periodically updated (Table 1) to account for new taxa and other taxonomic changes that are published monthly in the IJSEM (Munoz et al., 2011). All new sequences belong to the type strains of the species they are representing. This information is initially taken from LPSN and complemented with IJSEM and StrainInfo resources. While sequencing and submission of 23S rRNA sequence data still occasionally occur, the descriptions of almost all new species of Archaea and Bacteria include the accession number of a public 16S rRNA gene entry for the type strain, which can be downloaded from an INSDC database. All new entries included in the database are manually improved in terms of alignment and metadata as described above. In summary, the LTP database is updated according to the following criteria: (1) new names are given to an existing type strain when new combinations or earlier homotypic synonyms are published; (2) new sequence entries are used to represent existing type strains when they have been proven to be of better quality; (3) new type strains or neotypes and their respective sequence entries are added to the database in the case of new species and subspecies descriptions; and (4) entries are deleted from the database when their names should not be used to reference any type strain, such as later heterotypic synonyms or rejected names.

4 FEATURES OF THE DATABASE

4.1 OPTIMIZED SSU AND LSU ALIGNMENTS Although all sequences in the LTP originally come from the SILVA database, they are submitted to an extra manual supervision of the alignment, further resolving positions that have been previously misaligned. The SSU alignment is of better quality than the LSU alignment, given the ample experience with a much larger dataset. In this regard, the LTP release 102 involved a major improvement of the SILVA’s LSUseed alignment (i.e. a reference core of trusted sequences for database creation), including a manual inspection of 5000 new sequences which were appended to the 2800 existing in the seed. Moreover, the alignment was extended to a final size of 100,000 positions to better accommodate insertions contributed by different taxa (Yarza et al., 2010).

4.2 CURATED HIERARCHICAL CLASSIFICATION All sequences in the LTP are complemented with the hierarchic classification (from genus to phylum) available at LPSN, which includes merged information from the NCBI taxonomy, Taxonomic Outlines of Bergey’s Manual, TOBA and suggestions made by the LTP. The full classification is available in the database fields hi_tax_ltp

51

Table 1 Summary of LTP Releases Release

Type

Date

Total Entries

New Entries

Deleted Entries

Net Increase

% Incorrect

LTPs93

SSU

Aug. 2008

6728

6728

0

6728

22

LTPs95

SSU

Oct. 2008

7006

299

21

278

45

LTPs100

SSU

Sep. 2009

7710

750

46

704

50

LTPs102

SSU

Sep. 2010

8029

363

44

319

58

LTPs102

LSU

Sep. 2010

792

792

0

792

6

LTPs104

SSU

Mar. 2011

8545

545

29

516

74

LTPs106

SSU

Aug. 2011

8815

279

9

270

77

LTPs108

SSU

Jul. 2012

9279

490

26

464

60

LTPs111

SSU

Feb. 2013

9701

422

7

415

73

IJSEMSync

SILVASync

EMBLSync

Dec. 2007 Jun. 2008 Aug. 2009 Feb. 2010 Feb. 2010 Dec. 2010 May. 2011 Dec. 2011 Aug. 2012

Feb. 2008

Dec. 2007 Jun. 2008 Jun. 2009 Nov. 2009 Nov. 2009 May. 2010 Dec. 2010 Jun. 2011 Mar. 2012

Jul. 2008 Aug. 2009 Feb. 2010 Feb. 2010 Oct. 2010 Apr. 2011 Sep. 2011 Jul. 2012

‘Sync’ fields correspond to EMBL, IJSEM, SILVA and release dates. ‘Net Increase’ of a release is the number of new entries minus the number of deleted entries. ‘% Incorrect’ refers to the percentage of new entries whose INSDC records had incorrect information in the organism name field.

4 Features of the Database

Table 2 Description of LTP Specific Fields Field Name

Description

fullname_ltp rel_ltp hi_tax_ltp

Corrected species name according to LPSN (http://www.bacterio.net) Name of the LTP release where a sequence entry appears for the first time Name of the family where the taxon is classified or, for unclassified genera, the name of the next available high taxon above genus (e.g. ‘Unclassified Clostridiales’ for Blautia stercoris; Park, Kim, Roh, & Bae, 2012) Type species receive the label ‘type sp.’ in this field Risk-group classification of microorganisms according to the Federal Institute for Occupational Safety and Health (BAuA) in Germany Taxonomic classification into high-taxonomic ranks according to LPSN URL information to access LPSN’s species file

type_ltp riskgroup_ltp tax_ltp url_lpsn_ltp

and tax_ltp (Table 2). As a complement, information for nomenclatural types (i.e. type species of genera and higher ranks) has been retrieved from LPSN. All sequences belonging to nomenclatural types are labelled as such in another specific field type_ltp (Table 2). In addition, the correct species name (fullname_ltp, Table 2) according to LPSN is given to each sequence to replace the ‘organism name’ information which has appeared mistakenly in more than 50% of newly deposited type-strain 16S rRNA gene sequences (Table 1). The reason for these inconsistencies in primary data is a reflection of a delay between sequence submission and species publication (Yarza et al., 2008). In practise, a new isolate can be named differently until it is published according to the rules of the Bacteriological Code (Lapage et al., 1992) in one of the journals publishing new taxa. But the final name stands after its publication in IJSEM, either in an article or a validation list. For example, the nucleotide entry HE613447 submitted in 2011 reads ‘Achromobacter sp. R-46660’, when it really refers to the species name Achromobacter spiritinus effectively published in 2013. Wrong data stored in primary resources unavoidably spread to the whole network of databases (Figure 2). To improve this situation, authors are encouraged to update information even after the original submissions. In addition, scientific journals might play a role by requiring an update of taxonomic metadata of, at least, the sequences of type strains.

4.3 RISK-GROUP CLASSIFICATION Microbial species are classified by the Federal Institute for Occupational Safety and Health (BAuA) in Germany, according to the risk they present to humans, animals and plants. This information is regularly updated and made public in the Technische Regeln fu¨r Biologische Arbeitsstoffe (TRBA) document ‘Einstufung von Bakterien in Risikogruppen’ (TRBA 466). These data are initially implemented by the DSMZ, which serves as a source for mapping with the field riskgroup_ltp into the LTP database (Table 2).

53

54

CHAPTER 3 Living Tree Project (LTP)

4.4 TAXONOMIC THRESHOLDS The LTP dataset is an analytic tool which allows us to understand the meaning of numerical taxa boundaries based on sequence identity. The lowest identity found within each taxon of every rank is based on a distance matrix calculated with the LTP alignment. Considering all taxa at the levels of genus, family and phylum, a general lower-cutoff value for each rank was obtained. In general, 16S rRNA gene sequence identities lower than 94.9%  0.4, 87.5%  1.3, 78.4%  2.0 lead to the circumscription of new genera, families and phyla, respectively (Yarza et al., 2008). For 23S rRNA gene sequences, these cutoffs differ slightly: 93.2%  1.3 (genus), 87.7%  2.5 (family) and 75.3% (phylum) (Yarza et al., 2010). The low errors observed above (i.e. 95% confidence interval of the mean) indicate that taxonomists have historically circumscribed in a coherent way independently of the taxonomic group under study. Therefore, the application of numerical thresholds can be a complementary aid when describing new taxa or revising existing ones. Likewise, the use of these taxa boundaries may be useful for prospective studies (i.e. OTU abundances) on SSU and LSU environmental datasets.

5 PHYLOGENETIC TREES The SSU-based phylogenetic tree offered by the LTP is calculated using the complete dataset of type strains plus an additional selection of 3000 supporting sequences to stabilize certain under-represented groups (e.g. Cyanobacteria). The algorithm of choice is maximum likelihood implemented in the RAxML program (Stamatakis, 2006), as it is a robust method which, in addition, produces similar topologies to the neighbour-joining and maximum parsimony reconstructions (Yarza et al., 2008). To reduce noise caused by poorly resolved areas of alignments (i.e. due to hypervariability and sequencing errors), a 40% maximum frequency filter was applied and 1390 total alignment positions were considered. Calculation is performed with the GTRGAMMA model (Stamatakis, 2006) and 100 bootstrap replicates. In some releases (e.g. LTP release 111), newly added sequences are incorporated into the existing tree using the ARB parsimony tool with the option for keeping the initial topology while inserting additional sequences. Every 1–2 years, a new full reconstruction is calculated from scratch using the methodology explained above. The phylogenetic tree calculated for the LSU dataset followed different guidelines due to the extreme shortage of taxa in many groups (Yarza et al., 2008). Initially, a highly stringent filtering approach enabled 2000 high quality (and non-redundant) LSU sequences (type and non-type strain) from SILVA to stand in the initial core dataset. This set of sequences was subject to a maximum likelihood reconstruction in combination with a 50% maximum frequency filter, allowing 2463 positions of the entire alignment. The missing partial or lower quality type-strain sequences were added to the existing tree using the ARB parsimony tool.

6 LTP as a Taxonomic Tool

The resulting tree topologies are carefully examined to evaluate the monophyly assumption of every taxon. Clades are identified and named when the nomenclatural type is present (e.g. type species of genera, type genera of families). A comparison against many other taxon-specific and broad-range trees, using different sequence datasets and methods, has supported the main genealogical patterns inferred by the LTP (Yarza et al., 2010). LTP’s SSU and LSU phylogenies provide a way to readily examine classifications all along the current taxonomy. The SSU tree highlights a low significance with respect to the relative branching order of phyla, including some considerable ‘jumps’ (Acidobacteria, Cyanobacteria, Fusobacteria), as indicated by relatively short branches. In contrast, some taxa, such as the phylum Bacteroidetes–Chlorobi and the phylum Chlamydia–Verrucomicrobia–Lentisphaerae confirm well-supported higher order structures. In addition, it is interesting to see that the Deltaproteobacteria and Epsilonproteobacteria are separated from the rest of the Proteobacteria, resembling the weak union of the two sister classes (Bacilli and Clostridia) from the Firmicutes (Yarza et al., 2010). Although there is a tendency to show higher phylogenetic coherence at lower ranks, there are several well-known paraphyletic groups, like the Bacillaceae, Clostridiaceae and Pseudomonadaceae, which still require further taxonomic revision. Additionally, the LTP trees contribute new evidence for completing the classification of certain misclassified species, for example, Spongiispira norvegica (Kaesler et al., 2008) which was originally described as a member of the order Oceanospirillales but not associated with a family. However, current data show a SSU sequence similarity of 97% against Oceaniserpentilla haliotis and a clear affiliation with the family Oceanospirillaceae (LTP release 102; Yarza et al., 2010) (Figure 3).

6 LTP AS A TAXONOMIC TOOL The All-Species LTP is not an attempt to reconstruct the species phylogeny with total confidence, but is designed to provide the scientific community with a curated taxonomic tool. As such, the LTP is a collection of reference material (Table 3) that is publicly available and regularly updated. The package includes (i) the sequence database of SSU and LSU sequences from archaeal and bacterial type strains complemented with curated metadata in ARB and CSV formats, both including the LTP-specific fields (Table 2); (ii) the complete dataset of aligned type-strain sequences in FASTA format; (iii) a single phylogenetic tree containing all archaeal and bacterial species; and (iv) a set of descriptive tables and list of changes; see Table 3 for details. The regular corrections performed by the LTP on rRNA alignments and metadata contribute to the SILVA platform, thereby improving its quality over time (Quast et al., 2013). This transference of information exemplifies the network’s success in storing and providing reliable microbial information (Figure 2). Some examples of LTP usage in research projects include facilitating the collection of reference sequences for the reconstruction of taxa genealogies (e.g. Cousin et al.,

55

56

CHAPTER 3 Living Tree Project (LTP)

FIGURE 3 Phylogenetic position of Spongiispira norvegica Gp_4_7.1T based on 16S rRNA gene sequences. Tree topology extracted from the LTP release 111. Scale bar indicates estimated sequence divergence. Spongiispira norvegica was described as a member of the order Oceanospirillales without a family affiliation being given. However, it shows a SSU sequence similarity of 97% against Oceaniserpentilla haliotis and a clear affiliation with the family Oceanospirillaceae.

Table 3 Suite of Downloadable Materials Release information

Database exports Tree exports

Sequence exports Scripts

– Description of new release – Description of LTP fields – Table with INSDC entries with incorrect organism name information – Table with list of changes from last release (updated, deleted, new) – ARB format including all aligned sequences, metadata and trees – CSV export including metadata only – Full expanded tree in PDF – Full expanded tree in NEWICK – Collapsed overview in PDF – All aligned type-strain sequences in FASTA – All unaligned type-strain sequences in FASTA – ARB filter used to export LTP sequences in FASTA format

LTP release 111 (http://www.arb-silva.de/projects/living-tree).

2012), performing fast and reliable taxonomic affiliations in rRNA surveys (e.g. Santamaria et al., 2012) and serving as reference datasets for testing bioinformatic procedures (e.g. Mizrahi-Man, Davenport, & Gilad, 2013). In addition, a contact address ([email protected]) enables the exchange of suggestions or particular requests by the user community. LTP should continue to improve the speed of release production by improving data retrieval from secondary resources (LPSN, StrainInfo and SILVA) and by optimizing post-production analysis. Clear tendencies towards standardization, automation and quality management have been initiated, resulting in a better digital communication with the other microbial information resources.

References

ACKNOWLEDGEMENTS We acknowledge contributions from the LTP project leader, Ramon Rossello´-Mo´ra, and the scientific support of Rudolf Amann, Jean Euze´by, Wolfgang Ludwig, Karl-Heinz Schleifer, Frank Oliver Glo¨ckner, Michael Richter and Jo¨rg Peplies. We are also grateful for the feedback received from LTP users. Funding was provided by the Max Planck Society, Elsevier and the EU project SYMBIOMICS (grant EU-264774).

REFERENCES Amann, R. I., Ludwig, W., & Schleifer, K. H. (1995). Phylogenetic identification and in situ detection of individual microbial cells without cultivation. Microbiology Reviews, 59, 143–169. Cole, J. R., Chai, B., Farris, R. J., Wang, Q., Syed-Mohideen, A. S. K., McGarrell, D. M., et al. (2007). The ribosomal database project (RDP-II): Introducing myRDP space and quality controlled public data. Nucleic Acids Research, 35(Database issue), D169–D172. Cousin, S., Gulat-Okalla, M.-L., Motreff, L., Gouyette, C., Bouchier, C., Clermont, D., et al. (2012). Lactobacillus gigeriorum sp. nov., isolated from chicken crop. International Journal of Systematic and Evolutionary Microbiology, 62, 330–334. Dawyndt, P., Vancanneyt, M., De Meyer, H., & Swings, J. (2005). Knowledge accumulation and resolution of data inconsistencies during the integration of microbial information sources. IEEE Transactions on Knowledge and Data Engineering, 17, 1111–1126. DeSantis, T. Z., Hugenholtz, P., Larsen, N., Rojas, M., Brodie, E. L., Keller, K., et al. (2006). Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Applied and Environmental Microbiology, 72, 5069–5072. Euze´by, J. P. (1997). List of bacterial names with standing in nomenclature: A folder available on the Internet. International Journal of Systematic and Evolutionary Microbiology, 47, 590–592. Fox, G. E., Pechman, K. R., & Woese, C. R. (1977). Comparative cataloguing of 16S ribosomal ribonucleic acid: Molecular approach to prokaryotic systematics. International Journal of Bacteriology, 27, 44–57. Garrity, G. M., & Holt, J. G. (2001). The road map to the manual. In D. R. Boone, R. W. Castenholz, G. M. Garrity, et al. (Eds.), The Archaea and the deeply branching and phototrophic bacteria: Vol. 1. Bergey’s manual of systematic bacteriology. New York: Springer. Kaesler, I., Graeber, I., Borchert, M. S., Pape, T., Dieckmann, R., von Do¨hren, H., et al. (2008). Spongiispira norvegica gen. nov., sp. nov., a marine bacterium isolated from the boreal sponge Isops phlegraei. International Journal of Systematic and Evolutionary Microbiology, 58, 1815–1820. Klenk, H.-P., & Go¨ker, M. (2010). En route to a genome-based classification of Archaea and Bacteria? Systematic and Applied Microbiology, 33, 175–182. Lapage, S. P., Sneath, P. H. A., Lessel, E. F., Skerman, V. B. D., Seeliger, H. P. R., & Clark, W. A. (1992). International code of nomenclature of bacteria (1990 revision). Washington: Bacteriological Code, American Society for Microbiology. Ludwig, W., Euze´by, J., Schumann, P., Busse, H.-J., Trujillo, M. E., Ka¨mpfer, P., et al. (2012). Road map of the phylum Actinobacteria. In M. Goodfellow, P. Ka¨mpfer, H.-J. Busse, M. E. Trujillo, K.-i Suzuki, W. Ludwig, W. B. Whitman, et al. (Eds.), The Actinobacteria. part A: Vol. 5. Bergey’s manual of systematic bacteriology (pp. 1–28). New York: Springer.

57

58

CHAPTER 3 Living Tree Project (LTP)

Ludwig, W., Euze´by, J., & Whitman, W. B. (2010). Road map of the phyla Bacteroidetes, Spirochaetes, Tenericutes (Mollicutes), Acidobacteria, Fibrobacteres, Fusobacteria, Dictyoglomi, Gemmatimonadetes, Lentisphaerae, Verrucomicrobia, Chlamydiae, and Planctomycetes. In N. R. Krieg, J. T. Staley, D. Brown, B. P. Hedlund, B. J. Paster, N. Ward, W. Ludwig, & W. B. Whitman (Eds.), Bergey’s manual of systematic bacteriology: Vol. 4 (2nd ed., pp. 1–19). New York: Springer. Ludwig, W., Glo¨ckner, F. O., & Yilmaz, P. (2011). The use of rRNA gene sequence data in the classification and identification of prokaryotes. In F. Rainey A. Oren, Taxonomy of prokaryotes: Vol. 38. Methods in microbiology (pp. 349–384). Amsterdam: Academic Press. Ludwig, W., & Schleifer, K. H. (1994). Bacterial phylogeny based on 16S and 23S rRNA sequence analysis. FEMS Microbiology Reviews, 15, 155–173. Ludwig, W., Schleifer, K.-H., & Whitman, W. B. (2009). Revised road map to phylum The Firmicutes. In P. De Vos, G. M. Garrity, D. Jones, N. R. Krieg, W. Ludwig, F. A. Rainey, K.-H. Schleifer, W. B. Whitman, et al. (Eds.), Vol. 3. Bergey’s manual of systematic bacteriology (pp. 1–13). New York: Springer. Ludwig, W., Strunk, O., Westram, R., Richter, L., Meier, H., Kumar, Y., et al. (2004). ARB: A software environment for sequence data. Nucleic Acids Research, 32, 1363–1371. Mizrahi-Man, O., Davenport, E. R., & Gilad, Y. (2013). Taxonomic classification of bacterial 16S rRNA genes using short sequencing reads: Evaluation of effective study designs. PLoS One, 8, e53608. Munoz, R., Yarza, P., Ludwig, W., Euze´by, J., Amann, R., Schleifer, K.-H., et al. (2011). Release LTPs104 of the All-Species Living Tree. Systematic and Applied Microbiology, 34, 169–170. Park, S. K., Kim, M. S., Roh, S. W., & Bae, J. W. (2012). Blautia stercoris sp. nov., isolated from human faeces. International Journal of Systematic and Evolutionary Microbiology, 62, 776–779. Parte, A. C. (2014). LPSN-list of prokaryotic names with standing in nomenclature. Nucleic Acids Research, 42, D613–D616. Peplies, J., Kottmann, R., Ludwig, W., & Glo¨ckner, F. O. (2008). A standard operating procedure for phylogenetic inference (SOPPI) using (rRNA) marker genes. Systematic and Applied Microbiology, 31, 251–257. Pruesse, E., Quast, C., Knittel, K., Fuchs, B. M., Ludwig, W., Peplies, J., et al. (2007). SILVA: A comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB. Nucleic Acids Research, 35, 7188–7196. Quast, C., Pruesse, E., Yilmaz, P., Gerken, J., Schweer, T., Yarza, P., et al. (2013). The SILVA ribosomal RNA gene database project: Improved data processing and web-based tools. Nucleic Acids Research, 41(Database issue), D590–D596. Richter, M., & Rossello´-Mo´ra, R. (2009). Shifting the genomic gold standard for the prokaryotic species definition. Proceedings of the National Academy of Sciences of the United States of America, 106, 19126–19131. Santamaria, M., Fosso, B., Consiglio, A., Caro, G. D., Grillo, G., Licciulli, F., et al. (2012). Reference databases for taxonomic assignment in metagenomics. Briefings in Bioinformatics, 13, 682–695. Schloss, P. D. (2010). The effects of alignment quality, distance calculation method, sequence filtering, and region on the analysis of 16S rRNA gene-based studies. PLoS Computational Biology, 6, e1000844. Stamatakis, A. (2006). RAxML-VI-HPC: Maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics, 22, 2688–2690.

References

Tindall, B. J., Ka¨mpfer, P., Euze´by, J. P., & Oren, A. (2006). Valid publication of names of prokaryotes according to the rules of nomenclature: Past history and current practice. International Journal of Systematic and Evolutionary Microbiology, 56, 2715–2720. Yarza, P., Ludwig, W., Euze´by, J., Amann, R., Schleifer, K.-H., Glo¨ckner, F. O., et al. (2010). Update of the All-Species Living Tree Project based on 16S and 23S rRNA sequence analyses. Systematic and Applied Microbiology, 33, 291–299. Yarza, P., Richter, M., Peplies, J., Euzeby, J., Amann, R., Schleifer, K.-H., et al. (2008). The All-Species Living Tree project: A 16S rRNA-based phylogenetic tree of all sequenced type strains. Systematic and Applied Microbiology, 31, 241–250.

59