Skip to content

Commit 15d7731

Browse files
hlcianfagnaamotl
andauthored
Autogenerated sequences for PRIMARY KEY values (#185)
* First version of page about generating IDs automatically * Fix UDF link * Fix links formatting and wording * Add link to UUIDv7 UDF * Update docs/performance/inserts/sequences.rst Co-authored-by: Andreas Motl <[email protected]> * Update docs/performance/inserts/sequences.rst Co-authored-by: Andreas Motl <[email protected]> * Update docs/performance/inserts/sequences.rst Co-authored-by: Andreas Motl <[email protected]> * Update docs/performance/inserts/sequences.rst Co-authored-by: Andreas Motl <[email protected]> * Use field lists instead of bolded headings for pros and cons as suggested by @amotl * Reword paragraph about IDs from external systems along the lines of Andrea's comment * Define dependencies inline following PEP 723 as suggested by Andreas * Add comment about range queries as suggested by seut * Reworded the introduction as suggested by @kneth * Revert "Define dependencies inline following PEP 723 as suggested by Andreas" This reverts commit e2e1612. * Make field lists end with a blank line * Reapply "Define dependencies inline following PEP 723 as suggested by Andreas" This reverts commit 3e6e742. * Try inline dependencies with blank line after code directive * Fix identation of field lists * Replace tabs with spaces for identation * Use rst syntax for link instead of md * Replace infinite loop with safer approach with a max number of retries and exponential backoff --------- Co-authored-by: Andreas Motl <[email protected]>
1 parent ca58045 commit 15d7731

File tree

2 files changed

+206
-0
lines changed

2 files changed

+206
-0
lines changed

docs/performance/inserts/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,5 +30,6 @@ This section of the guide will show you how.
3030
parallel
3131
tuning
3232
testing
33+
sequences
3334

3435
.. _Abstract Syntax Tree: https://en.wikipedia.org/wiki/Abstract_syntax_tree
Lines changed: 205 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,205 @@
1+
.. _autogenerated_sequences_performance:
2+
3+
###########################################################
4+
Autogenerated sequences and PRIMARY KEY values in CrateDB
5+
###########################################################
6+
7+
As you begin working with CrateDB, you might be puzzled why CrateDB does not
8+
have a built-in, auto-incrementing "serial" data type as PostgreSQL or MySQL.
9+
10+
As a distributed database, designed to scale horizontally, CrateDB needs as many
11+
operations as possible to complete independently on each node without any
12+
coordination between nodes.
13+
14+
Maintaining a global auto-increment value requires that a node checks with other
15+
nodes before allocating a new value. This bottleneck would be hindering our
16+
ability to achieve `extremely fast ingestion speeds`_.
17+
18+
That said, there are many alternatives available and we can also implement true
19+
consistent/synchronized sequences if we want to.
20+
21+
************************************
22+
Using a timestamp as a primary key
23+
************************************
24+
25+
This option involves declaring a column as follows:
26+
27+
.. code:: psql
28+
29+
BIGINT DEFAULT now() PRIMARY KEY
30+
31+
:Pros:
32+
Always increasing number - ideal if we need to timestamp records creation
33+
anyway
34+
35+
:Cons:
36+
gaps between the numbers, not suitable if we may have more than one record on
37+
the same millisecond
38+
39+
*************
40+
Using UUIDs
41+
*************
42+
43+
This option involves declaring a column as follows:
44+
45+
.. code:: psql
46+
47+
TEXT DEFAULT gen_random_text_uuid() PRIMARY KEY
48+
49+
:Pros:
50+
Globally unique, no risk of conflicts if merging things from different
51+
tables/environments
52+
53+
:Cons:
54+
No order guarantee. Not as human-friendly as numbers. String format may not
55+
be applicable to cover all scenarios. Range queries are not possible.
56+
57+
************************
58+
Use UUIDv7 identifiers
59+
************************
60+
61+
`Version 7 UUIDs`_ are a relatively new kind of UUIDs which feature a
62+
time-ordered value. We can use these in CrateDB with an UDF_ with the code from
63+
`UUIDv7 in N languages`_.
64+
65+
:Pros:
66+
Same as `gen_random_text_uuid` above but almost sequential, which enables
67+
range queries.
68+
69+
:Cons:
70+
not as human-friendly as numbers and slight performance impact from UDF use
71+
72+
*********************************
73+
Use IDs from an external system
74+
*********************************
75+
76+
In cases where data is imported into CrateDB from external systems that employ
77+
identifier governance, CrateDB does not need to generate any identifier values
78+
and primary key values can be inserted as-is from the source system.
79+
80+
See `Replicating data from other databases to CrateDB with Debezium and Kafka`_
81+
for an example.
82+
83+
*********************
84+
Implement sequences
85+
*********************
86+
87+
This approach involves a table to keep the latest values that have been consumed
88+
and client side code to keep it up-to-date in a way that guarantees unique
89+
values even when many ingestion processes run in parallel.
90+
91+
:Pros:
92+
Can have any arbitrary type of sequences, (we may for instance want to
93+
increment values by 10 instead of 1 - prefix values with a year number -
94+
combine numbers and letters - etc)
95+
96+
:Cons:
97+
Need logic for the optimistic update implemented client-side, the sequences
98+
table becomes a bottleneck so not suitable for high-velocity ingestion
99+
scenarios
100+
101+
We will first create a table to keep the latest values for our sequences:
102+
103+
.. code:: psql
104+
105+
CREATE TABLE sequences (
106+
name TEXT PRIMARY KEY,
107+
last_value BIGINT
108+
) CLUSTERED INTO 1 SHARDS;
109+
110+
We will then initialize it with one new sequence at 0:
111+
112+
.. code:: psql
113+
114+
INSERT INTO sequences (name,last_value)
115+
VALUES ('mysequence',0);
116+
117+
And we are going to do an example with a new table defined as follows:
118+
119+
.. code:: psql
120+
121+
CREATE TABLE mytable (
122+
id BIGINT PRIMARY KEY,
123+
field1 TEXT
124+
);
125+
126+
The Python code below reads the last value used from the sequences table, and
127+
then attempts an `optimistic UPDATE`_ with a ``RETURNING`` clause, if a
128+
contending process already consumed the identity nothing will be returned so our
129+
process will retry until a value is returned, then it uses that value as the new
130+
ID for the record we are inserting into the ``mytable`` table.
131+
132+
.. code:: python
133+
134+
# /// script
135+
# requires-python = ">=3.8"
136+
# dependencies = [
137+
# "records",
138+
# "sqlalchemy-cratedb",
139+
# ]
140+
# ///
141+
142+
import time
143+
144+
import records
145+
146+
db = records.Database("crate://")
147+
sequence_name = "mysequence"
148+
149+
max_retries = 5
150+
base_delay = 0.1 # 100 milliseconds
151+
152+
for attempt in range(max_retries):
153+
select_query = """
154+
SELECT last_value,
155+
_seq_no,
156+
_primary_term
157+
FROM sequences
158+
WHERE name = :sequence_name;
159+
"""
160+
row = db.query(select_query, sequence_name=sequence_name).first()
161+
new_value = row.last_value + 1
162+
163+
update_query = """
164+
UPDATE sequences
165+
SET last_value = :new_value
166+
WHERE name = :sequence_name
167+
AND _seq_no = :seq_no
168+
AND _primary_term = :primary_term
169+
RETURNING last_value;
170+
"""
171+
if (
172+
str(
173+
db.query(
174+
update_query,
175+
new_value=new_value,
176+
sequence_name=sequence_name,
177+
seq_no=row._seq_no,
178+
primary_term=row._primary_term,
179+
).all()
180+
)
181+
!= "[]"
182+
):
183+
break
184+
185+
delay = base_delay * (2**attempt)
186+
print(f"Attempt {attempt + 1} failed. Retrying in {delay:.1f} seconds...")
187+
time.sleep(delay)
188+
else:
189+
raise Exception(f"Failed after {max_retries} retries with exponential backoff")
190+
191+
insert_query = "INSERT INTO mytable (id, field1) VALUES (:id, :field1)"
192+
db.query(insert_query, id=new_value, field1="abc")
193+
db.close()
194+
195+
.. _extremely fast ingestion speeds: https://cratedb.com/blog/how-we-scaled-ingestion-to-one-million-rows-per-second
196+
197+
.. _optimistic update: https://cratedb.com/docs/crate/reference/en/latest/general/occ.html#optimistic-update
198+
199+
.. _replicating data from other databases to cratedb with debezium and kafka: https://cratedb.com/blog/replicating-data-from-other-databases-to-cratedb-with-debezium-and-kafka
200+
201+
.. _udf: https://cratedb.com/docs/crate/reference/en/latest/general/user-defined-functions.html
202+
203+
.. _uuidv7 in n languages: https://github.com/nalgeon/uuidv7/blob/main/src/uuidv7.cratedb
204+
205+
.. _version 7 uuids: https://datatracker.ietf.org/doc/html/rfc9562#name-uuid-version-7

0 commit comments

Comments
 (0)