BSON is a binary format in which zero or more ordered key/value pairs are stored as a single entity. We call this entity a document.
The following grammar specifies version 1.1 of the
BSON standard. We've written the grammar using a
pseudo-BNF
syntax. Valid BSON data is represented by
the document
non-terminal.
The following basic types are used as terminals in the rest of the grammar. Each type must be serialized in little-endian format.
byte | 1 byte (8-bits) |
signed_byte(n) | 8-bit, two's complement signed integer for which the value is n |
unsigned_byte(n) | 8-bit unsigned integer for which the value is n |
int32 | 4 bytes (32-bit signed integer, two's complement) |
int64 | 8 bytes (64-bit signed integer, two's complement) |
uint64 | 8 bytes (64-bit unsigned integer) |
double | 8 bytes (64-bit IEEE 754-2008 binary floating point) |
decimal128 | 16 bytes (128-bit IEEE 754-2008 decimal floating point) |
The following specifies the rest of the BSON
grammar. Note that we use the *
operator as
shorthand for repetition (e.g. (byte*2)
is byte byte
). When used as a unary
operator, *
means that the repetition can
occur 0 or more times.
document | ::= | int32 e_list unsigned_byte(0) | BSON Document. int32 is the total number of bytes comprising the document. |
e_list | ::= | element e_list | |
| | "" | ||
element | ::= | signed_byte(1) e_name double | 64-bit binary floating point |
| | signed_byte(2) e_name string | UTF-8 string | |
| | signed_byte(3) e_name document | Embedded document | |
| | signed_byte(4) e_name document | Array | |
| | signed_byte(5) e_name binary | Binary data | |
| | signed_byte(6) e_name | Undefined (value) — Deprecated | |
| | signed_byte(7) e_name (byte*12) | ObjectId | |
| | signed_byte(8) e_name unsigned_byte(0) | Boolean - false | |
| | signed_byte(8) e_name unsigned_byte(1) | Boolean - true | |
| | signed_byte(9) e_name int64 | UTC datetime | |
| | signed_byte(10) e_name | Null value | |
| | signed_byte(11) e_name cstring cstring | Regular expression - The first cstring is the regex
pattern, the second is the regex options string.
Options are identified by characters, which must be stored in
alphabetical order. Valid option characters are i for
case insensitive matching, m for multiline
matching, s for dotall mode ("." matches
everything), x for verbose mode, and
u to make "\w", "\W", etc. match Unicode.
|
|
| | signed_byte(12) e_name string (byte*12) | DBPointer — Deprecated | |
| | signed_byte(13) e_name string | JavaScript code | |
| | signed_byte(14) e_name string | Symbol — Deprecated | |
| | signed_byte(15) e_name code_w_s | JavaScript code with scope — Deprecated | |
| | signed_byte(16) e_name int32 | 32-bit integer | |
| | signed_byte(17) e_name uint64 | Timestamp | |
| | signed_byte(18) e_name int64 | 64-bit integer | |
| | signed_byte(19) e_name decimal128 | 128-bit decimal floating point | |
| | signed_byte(-1) e_name | Min key | |
| | signed_byte(127) e_name | Max key | |
e_name | ::= | cstring | Key name |
string | ::= | int32 (byte*) unsigned_byte(0) | String - The int32 is the number of bytes in the (byte*) plus one for the trailing null byte. The (byte*) is zero or more UTF-8 encoded characters. |
cstring | ::= | (byte*) unsigned_byte(0) | Zero or more modified UTF-8 encoded characters
followed by the null byte. The (byte*) MUST NOT contain
unsigned_byte(0) , hence it is not full UTF-8.
|
binary | ::= | int32 subtype (byte*) | Binary - The int32 is the number of bytes in the (byte*). |
subtype | ::= | unsigned_byte(0) | Generic binary subtype |
| | unsigned_byte(1) | Function | |
| | unsigned_byte(2) | Binary (Old) | |
| | unsigned_byte(3) | UUID (Old) | |
| | unsigned_byte(4) | UUID | |
| | unsigned_byte(5) | MD5 | |
| | unsigned_byte(6) | Encrypted BSON value | |
| | unsigned_byte(7) | Compressed BSON column | |
| | unsigned_byte(8) | Sensitive | |
| | unsigned_byte(128)—unsigned_byte(255) | User defined | |
code_w_s | ::= | int32 string document | Code with scope — Deprecated |
unsigned_byte(2)
Binary (Old) - This used to be the default
subtype, but was deprecated in favor of subtype 0.
Drivers and tools should be sure to handle subtype 2
appropriately. The structure of the binary data (the byte* array in
the binary non-terminal) must be an int32 followed by a (byte*). The
int32 is the number of bytes in the repetition.unsigned_byte(3)
UUID (Old) - This used to be the UUID subtype,
but was deprecated in favor of subtype 4. Drivers
and tools for languages with a native UUID type should handle
subtype 3 appropriately.unsigned_byte(128)—unsigned_byte(255)
User defined subtypes. The binary data can be
anything.