SeqCode Provides a Path to Name Uncultivated Prokaryotes
Most prokaryotes have never been isolated in pure culture and cannot be named under the (ICNP). However, many studies have described uncultured bacteria and archaea though metagenome-assembled genome sequences (MAGs) or single-cell amplified genomes (SAGs), often in combination with physiological or ecological data. These studies have greatly expanded our view of prokaryotic life, its metabolic capabilities and its roles in the environment, but communication about these uncultured bacteria and archaea is difficult without formal taxonomic names. The provides a way to create scientifically precise taxonomic names for these uncultured prokaryotes by using genome sequences instead of cultures as nomenclatural types.
Nomenclatural types, or types, are elements to which a taxonomic name is attached and are also used to anchor taxonomic names of plants, animals and fungi. These elements can be compared to those of unidentified samples to decide whether or not the samples should have the same name. SeqCode names are compatible with the ICNP and allow creation of a unified taxonomy for cultured and uncultured prokaryotes. The SeqCode also sets standards for metagenome-assembled genomes (MAGs) and single-cell amplified genomes (SAGs) to be used in systematics, which ensures that nomenclatural types are of sufficient quality to unambiguously identify the taxon, create a meaningful taxonomy and support high-quality bioinformatic analyses.
Genome sequences are particularly well suited as nomenclatural types for prokaryotes. Previously, deposition of type strains in culture collections was necessary for naming new species of prokaryotes. Strains had to be cultivated from environmental samples and usually isolated as pure cultures, and then compared by a large number of mostly phenotypic tests in a process called polyphasic taxonomy to determine if they represented novel species.
When molecular methods were first introduced, it was discovered that these earlier methods were inaccurate, often grouping unrelated organisms together and failing to recognize closely related species. First, 16S rRNA sequences were used to reconstruct the phylogenetic relationships among type strains and guide their taxonomy. In the last decade, the genomes of most type strains have been sequenced, which allowed a more precise delineation of the phylogenetic relationships between species and greatly enhanced the basic knowledge of their properties.
Because the genome sequence is the blueprint for the properties of the entire organism, it unites the enormous literature on the physiology, molecular biology and ecology of prokaryotes with the properties of specific organisms. In addition, it informs us of key lifestyle features, such as bioenergetics, physiology, differentiation, motility and phage infections. Linking names to genome sequences allows important inferences from one scientific discipline to inform the other disciplines. In addition, there are also practical reasons to use genome sequences as types instead of strains. Sequences can be stored on computers and easily shared. Strains must be stored in culture collections, remain viable for the future and are comparatively difficult to share.
Phyla are the most distantly related lineages commonly used in prokaryotic systematics. A familiar phylum includes the Firmicutes (now also called the Bacillota), which includes genera such as Bacillus, Clostridium and Staphylococcus. Phyla are ancient groups that diverged billions of years ago. As such, members of different phyla vary greatly in their growth properties, cell structures and metabolisms. Since representatives of less than 1/3 of the phyla have been isolated as pure cultures, there is an enormous gap in our understanding of the breadth of prokaryote diversity.
Similarly, although estimates of the total number of prokaryote species vary widely, and named under the ICNP. At the current rate of description, it will take over 1,000 years to describe all the prokaryotic species believed to exist. Even in the human gut, which is probably the most thoroughly studied microbiome, many of the strains identified by metagenomic sequencing cannot be classified in known species or genera. , 696 strains of prokaryotes were identified, but only 460 and 321 could be assigned to genera and species described under the ICNP, respectively. Thus, even in a well-studied microbiome closely associated with the human body, a large fraction of the prokaryotes remains unnamed.
The SeqCode establishes the to create and validate names of prokaryotes based on genome sequences. Genome sequences are derived from genome assemblies, which are computed from the raw sequence data to represent the best estimate of the actual genome sequence. For metagenomic studies comprising large-scale sequencing of environmental DNA, the genome assemblies must be >90% complete and <5% contaminated to serve as nomenclatural types. For cultures isolated from the environment, the read coverage of the genome sequence must be >10-fold to serve as a type. In addition, the assembly of the type sequence and raw data must be available in an International Nucleotide Sequence Database Collaboration (INSDC) database, such as and the . The raw sequences are required so that, as new assembly methods are developed, the sequence can be improved.
In a typical investigation, a MAG, SAG or other genome sequence is compared to the databases to determine if close relatives have been previously described. A common criterion that is used is the new sequence must possess an average nucleotide identity or ANI of <95% to any other described species to represent a new species. If no close relatives are found, a decision might be made to name and describe a new species. Preliminary registration can then begin during manuscript preparation to allow SeqCode curators to check the data quality and ensure that the name and its etymology are correctly formed. In this case, the taxonomic name and metadata are entered into the SeqCode Registry. Within the registry, curators perform data quality and name synonymy and etymology checks leading to provisional acceptance of names that comply with SeqCode rules. This procedure ensures data quality and correct naming and avoids errata after publication for corrections. Entry of the publication’s digital object identifier (DOI) into the registry marks the date of validation, which establishes the name’s priority. Because the SeqCode [and the ICNP] require that the earliest name of a taxon be used, the priority establishes the precedence of this name. If there is no other name for the taxon with an earlier date of validation, this name must be used.
Path 2 is for names that are already published, such as Candidatus names. Candidatus names are provisional names established under the ICNP that are usually based on genome sequences and other data, but they have not been validated because a culture has not been deposited into 2 culture collections. However, if the genome sequences meet the data standards, they can be validated in the SeqCode after being checked by the SeqCode curators.
Want to learn more about the amazing ecology, evolution and biodiversity (EEB) of microorganisms?
Nomenclatural types, or types, are elements to which a taxonomic name is attached and are also used to anchor taxonomic names of plants, animals and fungi. These elements can be compared to those of unidentified samples to decide whether or not the samples should have the same name. SeqCode names are compatible with the ICNP and allow creation of a unified taxonomy for cultured and uncultured prokaryotes. The SeqCode also sets standards for metagenome-assembled genomes (MAGs) and single-cell amplified genomes (SAGs) to be used in systematics, which ensures that nomenclatural types are of sufficient quality to unambiguously identify the taxon, create a meaningful taxonomy and support high-quality bioinformatic analyses.
Why Base Names on Genome Sequence?
Taxonomic names are important for science communication, and SeqCode provides the means to name uncultured prokaryotes. Without a generally agreed upon naming system, history has shown that nomenclature becomes imprecise, redundant and confusing. Precise naming is also critical for creation of stable and effective bioinformatic databases and for large-scale analyses common in bioinformatics. Without a single, permanent name for taxa, databases would require frequent curation, which would be increasingly difficult as the information content grows. Thus, a naming system is also important for the application of modern informatics tools.Genome sequences are particularly well suited as nomenclatural types for prokaryotes. Previously, deposition of type strains in culture collections was necessary for naming new species of prokaryotes. Strains had to be cultivated from environmental samples and usually isolated as pure cultures, and then compared by a large number of mostly phenotypic tests in a process called polyphasic taxonomy to determine if they represented novel species.
When molecular methods were first introduced, it was discovered that these earlier methods were inaccurate, often grouping unrelated organisms together and failing to recognize closely related species. First, 16S rRNA sequences were used to reconstruct the phylogenetic relationships among type strains and guide their taxonomy. In the last decade, the genomes of most type strains have been sequenced, which allowed a more precise delineation of the phylogenetic relationships between species and greatly enhanced the basic knowledge of their properties.
Because the genome sequence is the blueprint for the properties of the entire organism, it unites the enormous literature on the physiology, molecular biology and ecology of prokaryotes with the properties of specific organisms. In addition, it informs us of key lifestyle features, such as bioenergetics, physiology, differentiation, motility and phage infections. Linking names to genome sequences allows important inferences from one scientific discipline to inform the other disciplines. In addition, there are also practical reasons to use genome sequences as types instead of strains. Sequences can be stored on computers and easily shared. Strains must be stored in culture collections, remain viable for the future and are comparatively difficult to share.
Uncultured Prokaryotes Are Common
Metagenome sequencing has revealed an enormous diversity of prokaryotes, most of which have never been cultured. of DNA from many biomes has identified MAGs representing 135 phyla. In contrast, only about 40 phyla are represented by cultures deposited in culture collections and named under the ICNP.Phyla are the most distantly related lineages commonly used in prokaryotic systematics. A familiar phylum includes the Firmicutes (now also called the Bacillota), which includes genera such as Bacillus, Clostridium and Staphylococcus. Phyla are ancient groups that diverged billions of years ago. As such, members of different phyla vary greatly in their growth properties, cell structures and metabolisms. Since representatives of less than 1/3 of the phyla have been isolated as pure cultures, there is an enormous gap in our understanding of the breadth of prokaryote diversity.
Similarly, although estimates of the total number of prokaryote species vary widely, and named under the ICNP. At the current rate of description, it will take over 1,000 years to describe all the prokaryotic species believed to exist. Even in the human gut, which is probably the most thoroughly studied microbiome, many of the strains identified by metagenomic sequencing cannot be classified in known species or genera. , 696 strains of prokaryotes were identified, but only 460 and 321 could be assigned to genera and species described under the ICNP, respectively. Thus, even in a well-studied microbiome closely associated with the human body, a large fraction of the prokaryotes remains unnamed.
How to Register Names Under the SeqCode
The SeqCode establishes the to create and validate names of prokaryotes based on genome sequences. Genome sequences are derived from genome assemblies, which are computed from the raw sequence data to represent the best estimate of the actual genome sequence. For metagenomic studies comprising large-scale sequencing of environmental DNA, the genome assemblies must be >90% complete and <5% contaminated to serve as nomenclatural types. For cultures isolated from the environment, the read coverage of the genome sequence must be >10-fold to serve as a type. In addition, the assembly of the type sequence and raw data must be available in an International Nucleotide Sequence Database Collaboration (INSDC) database, such as and the . The raw sequences are required so that, as new assembly methods are developed, the sequence can be improved.
In a typical investigation, a MAG, SAG or other genome sequence is compared to the databases to determine if close relatives have been previously described. A common criterion that is used is the new sequence must possess an average nucleotide identity or ANI of <95% to any other described species to represent a new species. If no close relatives are found, a decision might be made to name and describe a new species. Preliminary registration can then begin during manuscript preparation to allow SeqCode curators to check the data quality and ensure that the name and its etymology are correctly formed. In this case, the taxonomic name and metadata are entered into the SeqCode Registry. Within the registry, curators perform data quality and name synonymy and etymology checks leading to provisional acceptance of names that comply with SeqCode rules. This procedure ensures data quality and correct naming and avoids errata after publication for corrections. Entry of the publication’s digital object identifier (DOI) into the registry marks the date of validation, which establishes the name’s priority. Because the SeqCode [and the ICNP] require that the earliest name of a taxon be used, the priority establishes the precedence of this name. If there is no other name for the taxon with an earlier date of validation, this name must be used.
Path 2 is for names that are already published, such as Candidatus names. Candidatus names are provisional names established under the ICNP that are usually based on genome sequences and other data, but they have not been validated because a culture has not been deposited into 2 culture collections. However, if the genome sequences meet the data standards, they can be validated in the SeqCode after being checked by the SeqCode curators.
How the SeqCode was Developed
The SeqCode was started on Jan. 1, 2022, by the SeqCode Organizing Committee, and a beta version of the online SeqCode Registry is currently operational. after a series of online meetings and in-person workshops in 2018 and 2019 that resulted in the publication of a , which was endorsed by 120 prominent microbiologists from around the globe. The statement included 2 proposals. The first was to allow DNA sequences to serve as nomenclatural types in the ICNP. However, this proposal was rejected by the International Committee on Systematics of Prokaryotes (ICSP), the governing body of the ICNP. The alternative proposal was to prepare the SeqCode, which would allow creation of stable names compatible with the ICNP so that the 2 codes could be merged in the future. The first draft of the SeqCode was completed in the summer of 2020 and discussed at a series of international online workshops held during February 2021 under the banner of the International Society for Microbial Ecology (ISME). These workshops engaged 848 registrants from 42 countries. As a result of this process, the first edition of the SeqCode was published in the summer of 2022.Join the SeqCode Community
The success of the SeqCode depends on community participation, and those interested in microbial nomenclature are welcome to . Membership enables voting for amendments of the SeqCode and its statutes, election of officers and participation in discussions in systematics. The statutes for the administration of the SeqCode governing body, the , are also available. In addition to providing a valuable service, the intention is to make naming easy and improve the stability of names. Community guidance is essential to achieve these goals. SeqCode also embraces the to encourage penetration of the Registry to other databases.Want to learn more about the amazing ecology, evolution and biodiversity (EEB) of microorganisms?