Introduction

The International Nucleotide Sequence Database Collaboration (INSDC) consists of the DNA Data Bank of Japan (DDBJ), the European Molecular Biology Laboratory (EMBL) and GenBank at NCBI. As part of this Collaboration, all three organizations accept new sequence submissions and share sequence data among the three databases. To facilitate the exchange of data, each member of the collaboration is assigned certain accession prefixes.

Format for GenBank accession numbers

Type Format

Nucleotide

1 letter + 5 numerals or 2 letters + 6 numerals or 2 letters + 8 numerals

Protein

3 letters + 5 numerals or 3 letters + 7 numerals

WGS

4 letters + 2 numerals for WGS assembly version + 6 or more numerals or
6 letters + 2 numerals for WGS assembly version + 7 or more numerals

MGA

5 letters + 7 numerals

Primary GenBank accession number prefixes

Accession prefixes Data source

AE, CP, CY

Genome project data

U, AF, AY, DQ, EF, EU, FJ, GQ, GU, HM, HQ, JF, JN, JQ, JX, KC, KF, KJ, KM, KP, KR, KT, KU, KX, KY, MF, MG, MH, MK, MN, MT, MW, MZ, OK, OL, OM

Direct submissions

AAAA-AZZZ, JAAA-JZZZ, LAAA-LZZZ, MAAA-MZZZ, NAAA-NZZZ, PAAA-PZZZ, QAAA-QZZZ, RAAA-RZZZ, SAAA-SZZZ, VAAA-VZZZ, WAAA-WZZZ, XAAA-XZZZ, AAAAAA-AZZZZZ, JAAAAA-JZZZZZ

Whole genome shotgun sequences

AAA-AZZ, QAA-QZZ, UAA-UZZ

Protein ID

DAA-DZZ

TPA or TPA WGS Protein ID

EAA-EZZ, KAA-KZZ, OAA-OZZ, PAA-PZZ, RAA-RZZ, TAA-TZZ, MAA-MZZ, NAA-NZZ

WGS protein ID

Version number suffix

A GenBank sequence version number consists of an accession number of the record followed by a dot and a version number (i.e., Accession.Version). The version number will increment by one when there is an update to the sequence record.

Format for RefSeq accession numbers

RefSeq accession numbers do not follow the naming conventions set by INSDC — they have a two-letter prefix followed by an underscore. RefSeq records are classified as "Known RefSeq" (manually reviewed by NCBI staff or collaborators) or "Model RefSeq" (records produced by an automated pipeline).

Accession prefixes Type Description

NC_

Known RefSeq

Complete genomic molecule (Reference assembly)

NG_

Known RefSeq

Incomplete genomic region

NM_

Known RefSeq

mRNA

NR_

Known RefSeq

Non-coding RNA

NP_

Known RefSeq

Protein

NT_, NW_

Known RefSeq

Genomic contig or scaffold

XM_

Model RefSeq

mRNA (Predicted model)

XR_

Model RefSeq

Non-coding RNA (Predicted model)

XP_

Model RefSeq

Protein (Predicted model)


The complete list of accession numbers is available at the "Accession Number prefixes: Where are the sequences from?" webpage.