Writing data to Glean
This page describes the various ways in which data gets into Glean.
There are two main methods for creating a DB. Repo-wide indexing jobs which require multiple workers and have dependent tasks are managed by the server, while simple one-off DB creation can be performed independently by a single client.
Client-driven writingβ
A database can be created by a client using any of these methods:
- Programmatically, using one of the APIs listed in APIs for Writing.
- On the command line: invoke the
glean
command-line tool to send data in JSON format, see Creating a database using the command line. - In the shell, use
glean shell --db-root=<dir>
and then use the command:load
to create a DB from a JSON file. See Loading a DB from JSON in the shell.
Server-driven writingβ
Large indexing jobs are coordinated by the server, using a recipe to
define the various tasks and the dependencies between them. Recipes
are defined in the recipes configuration; see the --recipe-config
option in Common options.
The job proceeds as follows:
An indexing job is started by calling the server's
kickOff
Thrift method. This creates a work queue of tasks on the server.Clients obtain tasks from the server by calling
getWork
. Tasks may have dependencies between them, so the server won't hand out a task until its dependencies are complete.When all tasks are done, the server marks the database as complete.
APIs for writingβ
- The Haskell API for writing
If none of the above work for you, the Thrift API enable basic write access to the database.
kickOff
can be used to create a new DBsendJsonBatch
is for sending facts in JSON-serialized formfinishBatch
exposes the result of a previously sent JSON batchworkFinished
closes a DB
A rough outline of a client looks like:
glean = make_glean_thrift_client()
db_handle = make_uuid()
glean.kickOff(my_repo, KickOffFill(writeHandle=db_handle))
for json_batch in json_batches:
handle = glean.sendJsonBatch(json_batch)
result = glean.finishBatch(handle)
# handle result
glean.workFinished(my_repo, db_handle, success_or_failure)
Writing from the command lineβ
JSON formatβ
The JSON format for Glean data is described in Thrift and JSON.
Here's an example of JSON data for writing to Glean:
[
{ "predicate": "cxx1.Name.1", # define facts for cxx1.Name.1
"facts": [
{ "id": 1, "key": "abc" }, # define a fact with id 1
{ "id": 2, "key": "def" }
]
},
{ "predicate": "cxx1.FunctionName.1", # define facts for cxx1.FunctionName.1
"facts": [
{ "id": 3,
"key": {
"name": { "id": 1 }}} # reference to fact with id 1
]
},
{ "predicate": "cxx1.FunctionQName.1", # define facts for cxx1.FunctionQName.1
"facts": [
{ "key": {
"name": 3, # 3 is shorthand for { "id": 3 }
"scope": { "global_": {} } } },
{ "key": {
"name": {
"key": { # define a nested fact directly
"name": {
"key": "ghi" }}}, # another nested fact
"scope": {
"namespace_": {
"key": {
"name": {
"key": "std" }}}}}
]
}
]
The rules of the game are:
- Predicate names must include versions, i.e.
cxx1.Name.1
rather thancxx1.Name
. - The
id
field when defining a fact is optional. Theid
numbers in the input file will not be the finalid
numbers assigned to the facts in the database. - There are no restrictions on
id
values (any 64-bit integer will do) but anid
value may not be reused within a file. - Later facts may refer to earlier ones using either
{ "id": N }
or justN
. - It is only possible to refer to
id
s from facts in the same file, if you are writing multiple files usingglean write
or via thesendJsonBatch
API. - a nested facts can be defined inline, instead of defining it with an
id
first and then referencing it. - an inline nested fact can be given an
id
and referred to later.
Loading a DB from JSON in the shellβ
The shell is useful for experimenting with creating a DB from JSON data directly. Let's try loading the data above into a DB in the shell:
$ mkdir /tmp/glean
$ glean shell --db-root /tmp/glean
Glean Shell, dev mode
type :help for help.
no fbsource database availabe
> :load test/0 /home/smarlow/test
I0514 01:19:37.137109 3566745 Work.hs:184] test/16: database complete
Let's see what facts we loaded:
test> :stat
1
count: 72
size: 5988
cxx1.FunctionName.1
count: 2
size: 66
cxx1.FunctionQName.1
count: 2
size: 70
cxx1.Name.1
count: 4
size: 148
cxx1.NamespaceQName.1
count: 1
size: 35
test>
Note that there were 4 cxx1.Name.1
facts - some of those were defined as inline nested facts in the JSON. We can query them all:
test> cxx1.Name _
4 results, 1 queries, 4 facts, 0.22ms, 44296 bytes
{ "id": 1096, "key": "abc" }
{ "id": 1097, "key": "def" }
{ "id": 1100, "key": "ghi" }
{ "id": 1102, "key": "std" }
Note that the id
values here do not correspond to the id
values in the input file.
Creating a database using the command lineβ
The glean
command-line tool can be used to create a database directly on the server.
To create a database from a single file of JSON facts:
glean create --service <write-server> --finish --db <name>/<instance> <filename>
where
<write-server>
is thehost:port
of the Glean server
<name>
is the name for your DB. For indexing repositories we normally use the name of the repository, but it's just a string, so you can use whatever you want.<hash>
identifies this particular instance of your database. For repositories we normally use the revision hash, but, again, it's just a string.<filename>
the file containing the JSON facts.
If the file is more than, say, 100MB, this operation will probably time out sending the data to the server. To send large amounts of data you need to batch it up into multiple files, and then send it like this:
glean create --service <write-server> --db <name>/<hash>
glean write --service <write-server> --db <name>/<hash> <filename1>
glean write --service <write-server> --db <name>/<hash> <filename2>
...
glean finish --service <write-server> --db <name>/<hash>
To find out if your DB made it:
glean shell --service <write-server> :list
This will list the DBs available on the write server.