Incrementality
This is a walkthrough of the most important parts of the implementation of incrementality, aimed at people working on Glean.
Unitsβ
A unit is an arbitrary string. In Thrift:
typedef string (hs.type = "ByteString") UnitName
Units do not need to be declared beforehand; a Unit exists if it is the owner of at least one fact.
Each unit is assigned a unique integer, called UnitId
. This is done
by the storage backend the first time the unit is encountered, and
thereafter the unit will be referred to by its UnitId
.
Capturing ownership data from the indexerβ
The indexer provides an initial mapping of facts to units (or units to facts, if you like). A fact can have an arbitrary number of owners. This is done in one of two ways.
JSONβ
For indexers that produce JSON, the unit is an optional field:
[
{
"predicate": "glean.test.Node.5",
"facts" : [
{
"key": { "label": "d" }
}
],
"unit": "D"
},
The unit
here is the owner of all the facts in the facts
array.
For an example JSON file with units, see
.Binaryβ
Indexers that produce binary facts can annotate a batch of facts with owners:
struct Batch {
...
5: map<UnitName, list<Id> (hs.type = "VectorStorable")> (
hs.type = "HashMap",
) owned;
Specifies ownership of the facts in this batch. The list is really a
list of intervals [x1,x2, y1,y2, ... ]
representing the inclusive
ranges x1..x2
, y1..y2
, ... where x1 <= x2
, y1 <= y2
, ...
The ranges do not need to be sorted, and can overlap.
Storing the ownership dataβ
The raw ownership data from the indexer is captured when the facts are written to the DB. The fact IDs in the ownership data are substituted with the final fact IDs of the facts.
Note that we might be writing a fact that already exists in the DB; in that case we don't write the fact, but we must still remember the ownership mapping, because it might be adding additional owners.
Ownership setsβ
In general the owner of a fact is an ownership set, which has the following form:
set ::= unit
| set_1 || ... || set_n
| set_1 && ... && set_n
That is, an ownership set is either a unit, or a disjunction of sets
(set_1 || ... || set_n
), or a conjunction of sets (set_1 && ... &&
set_n
).
Each unique set is assigned a unique ID, called UsetId
in the
implementation. UsetId
and UnitId
use distinct ranges, so we can
represent a set in a canonical way as an operator (||
or &&
)
together with an integer set. The implementation of this is
SetExpr<T>
in
A set may only refer to UsetId
s that are smaller than its UsetId
(just like facts can only refer to smaller fact IDs).
The efficiency of this scheme depends on the assumption that while there are a lot of facts, there are relatively few distinct ownership sets. Typically we see between 10-100x more facts than sets. The storage we need for sets is therefore small relative to the size of the DB. This assumption rests on the indexer using a coarse enough range of units; one unit per file or module is good, but one unit per function would lead to more sets.
Visibility and slicesβ
When building an incremental DB, we will exclude some units from the base DB. This is done by building a slice (see
)- A slice is a set of
UsetId
- A
UsetId
is in the slice ifunit
:unit
is not excludedset_1 || ... || set_n
: any ofset_1 ... set_n
are in the sliceset_1 && ... && set_n
: all ofset_1 ... set_n
are in the slice
- A fact is visible if its
UsetId
is in the slice
A slice is represented by a bitmap. To construct the slice we
determine the visibility of each UsetId
in ascending order, so that
when we process a given set we already know the visibility of the sets
it refers to.
Propagating ownership to dependent factsβ
The property we want is that
In a DB consisting of all the visible facts for a given set of excluded units, every fact referenced by a visible fact is also visible.
(Basically, there are no dangling references.)
To achieve this, we have to make sure that the ownership sets are built such that this property is true for any set of excluded units.
This is done as follows:
- for a fact F with ownership set A
- for each fact referenced by F with ownership set B
- we make the owner
A || B
- we make the owner
- for each fact referenced by F with ownership set B
This is obviously correct: if F is visible, then every fact it refers to will also be visible.
In practice we make this efficient by:
Traverse facts in reverse order, so we always visit a fact before the facts it depends on, and we only have to visit each fact once.
We only have disjunctions at this stage (conjunctions arise with derived facts in the next stage), so we can normalise
).A || B
by taking the union of the setsA
andB
. We use an efficient integer set representation with fast union (see
This propagation requires O(facts) time, and the current implementation also requires O(facts) space too.
The implementation of this propagation is computeOwnership
in
Ownership propagation happens after all the facts are written, but
before derivation starts. This is why we have the glean complete
command: it marks the predicates complete and computes ownership sets
for the written facts.
In the future we could consider
- Starting the propagation in parallel with fact writing. We may have to visit facts multiple times, however.
- Do it with less than O(facts) space, to support large DBs. We'll probably have to write intermediate results to the DB.
Storing ownership in the DBβ
Ownership that we store in the DB consists of:
ownershipUnits
: a mappings from unit toUnitId
ownershipUnitIds
: a mapping fromUnitId
to unit. This is used only for introspection (displaying an ownership set, e.g. for the:owner
command in the shell).ownershipSets
: a mapping fromUsetId
to set, encoded using Elias Fano Coding.factOwners
: a mapping fromId
toUsetId
, giving the owner for each fact ID. This is an interval map, so for a series of consecutive fact Ids with the same owner, we only store the first.
Caching fact ownershipβ
Reading from factOwners
each time we want the ownership of a fact
can be very expensive, because we need to check the owner
every time we visit a fact during a query to determine its visibility.
We therefore cache the ownership information in memory. This is done lazily: we read a page of fact owners at a time. To find the owner of a fact, we binary-search in the page to find the beginning of the interval containing the desired fact ID.
Ownership for derived factsβ
A derived fact should be visible if and only if all the facts that it was derived from are visible.
This naturally gives rise to the following implementation:
- The owner for a fact derived from facts
F1...Fn
with ownersO1...On
respectively isO1 && ... && On
This is straightforward except that we might have multiple threads deriving in parallel, and e want to avoid having duplicate ownership sets created.
So the implementation is similar to the way facts are written:
When deriving we create a batch of ownership information (
DefineOwnership
) consisting of the new ownership sets and the mapping from facts to ownership sets. EachDefineOwnership
is written to the DB as it is created, usingaddDefineOwnership
.When writing the facts to the DB, we de-duplicate the ownership sets against those in the DB, and renumber the sets as necessary. This is
computeDerivedOwnership
.
Note that because derivation needs to know the owners of facts derived from, the ownership propagation described above needs to be complete first.
Ownership for stacked databasesβ
The main complication with stacked DBs is that a DB may induce new ownership for facts in a lower DB in the stack, as described above.
For the stacked DB we do ownership propagation as before, including propagating ownership to facts in the base DBs. But we don't want to have to consider all facts in the stack, so instead of walking through all facts in the stack we switch to using a priority queue for facts in the base DB(s) to track facts to which we are propagating ownership.
Now that a fact in the base DB can have multiple owners, we have an implementation choice:
getOwner(fact)
returns all owners for the fact.- A fact is visible if any of its owners is in the slice.
- When deriving facts later, we have to construct the set of the disjunction of all owners for a fact we derived from.
getOwner(fact)
returns the first owner in the stack. For this to work, the first owner has to include the base DB owner too. For example, if fact F has base owner A and new owner B propagated from the stacked DB, then its owner in the stacked DB should beA || B
. Now when deriving, we only have one owner for each fact and we don't need to construct new sets.
In Glean we currently do (2) because it seemed simpler.
Derived facts in stacked incremental DBsβ
There is a problem here: suppose we have a fact F with owner A in the
base DB, and we propagate a new owner B to F in a stacked DB. F's
owner is now A || B
. However, facts that derived from F in the
original base DB will have ownership sets that don't include B. The
consequence is that later we might have a situation where F is
visible, but facts derived from it are not.
To fix this, we want to do incremental derivation on the stacked DB, and propagate this new ownership to the base DB, which will fix up the ownership of the derived facts. Incremental derivation must therefore consider facts that have new ownership in the stacked DB when deriving. At the time of writing, this isn't implemented yet.