Skip to content

Instantly share code, notes, and snippets.

@bunkat
Last active August 29, 2015 14:01
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save bunkat/9494429e3cbbbeda2b11 to your computer and use it in GitHub Desktop.
Save bunkat/9494429e3cbbbeda2b11 to your computer and use it in GitHub Desktop.
Getting lots of optional nodes in a single Cypher query
= Graph Initialization
I'm using Neo4j 2.0.1 with Cypher over the Batch REST API.
//hide
//setup
//output
[source,cypher]
----
CREATE (u:user {id: "u1", name: "Bill"})-[:CONTACT]->(c:contact {id: "c1"})
CREATE (u)-[:ADDRESS]->(:address {id: "a1", city: "New York"})
CREATE (u)-[:ADDRESS]->(:address {id: "a2", city: "San Francisco"})
CREATE (u)-[:PHONE]->(:phone {id: "p1", number: "555-1234"})
CREATE (c)-[:PHONE]->(:phone {id: "p2", number: "555-3456"})
CREATE (c)-[:PHONE]->(:phone {id: "p3", number: "555-7890"})
CREATE (c)-[:CREATED]->(:activity {id: "h1", createdat: 1000})<-[:USERACTIVITY]-(:contact {id: "c2"})
CREATE (c)-[:HISTORY]->(:activity {id: "h2", createdat: 2000})<-[:USERACTIVITY]-(:contact {id: "c3"})
----
= Simplified Data Model
Here is what the simplified data model looks like. I have a `user` node that is connected to a `contact` node. Both the `user` node and `contact` node can have 0 or more phone numbers and addresses associated with them. Each `contact` node also points to `activity` nodes which capture when the node was edited and when it was created. I'm trying to pull all of this data back in a single query to reduce round trips to the database.
// graph
= Current Query
Here is the current query that I'm using, but I'm sure it can be improved quite a bit. In my production queries, I use parameters instead of populating the `id` fields directly.
//output
[source,cypher]
----
MATCH (u:user { id: "u1" })
WITH u
MATCH u-[:CONTACT]->(c:contact)
WITH u, c
OPTIONAL MATCH (c)-[:CREATED]->(xca:activity)<-[:USERACTIVITY]-(xcc:contact)
OPTIONAL MATCH (c)-[:HISTORY]->(xcu:activity)<-[:USERACTIVITY]-(xuc:contact)
OPTIONAL MATCH (c)-[:PHONE]->(xp:phone)
OPTIONAL MATCH (c)-[:ADDRESS]->(xa:address)
OPTIONAL MATCH (u)-[:PHONE]->(xup:phone)
OPTIONAL MATCH (u)-[:ADDRESS]->(xua:address)
WITH DISTINCT c AS x, u,
COLLECT(DISTINCT xp) AS xps,
COLLECT(DISTINCT xa) AS xas,
COLLECT(DISTINCT xup) AS xups,
COLLECT(DISTINCT xua) AS xuas,
xca.createdat AS createdat,
xcu.createdat AS updatedat,
{id: xcc.id} AS createdby,
{id: xuc.id} AS updatedby
RETURN COLLECT({
id: x.id,
name: COALESCE(u.name, x.name),
createdat: createdat,
createdby: createdby,
updatedat: updatedat,
updatedby: updatedby,
phones: (CASE WHEN size(xps)= 0
THEN NULL
ELSE [xp IN xps | { id: xp.id, number: xp.number}]
END),
userphones: (CASE WHEN size(xups)= 0
THEN NULL
ELSE [xup IN xups | { id: xup.id, number: xup.number }]
END),
addresses: (CASE WHEN size(xas)= 0
THEN NULL
ELSE [xa IN xas | { id: xa.id, city: xa.city}]
END),
useraddresses: (CASE WHEN size(xuas)= 0
THEN NULL
ELSE [xua IN xuas | { id: xua.id, city: xua.city}]
END)
}) AS r
----
//table
= Results
The current performance I'm seeing is pretty dismal. If I remove all of the OPTIONAL MATCH statements and just return the `contact` nodes, it's about 4x faster but still much too slow. When I use a similar query to return 1000 contact nodes, it starts to crawl (even after running multiple times to make sure it is cached).
I have indexes set up on both contacts and users by id. I've used the PROFILE command to see if anything looks funny, but I see the indexes being used and no _rows counter or _db_hits_ counts above 3000. My database currently has about 12000 nodes in it, 1100 are in the result set I'm trying to return.
I've had problems with slow disks before, but since this is a read operation there should be no locks or disk syncs causing the performance issue. I've also moved to a new server which is a 24 proc Xeon with 128 GB of memory. The database sits by itself on 6 enterprise SSDs using RAID 10. I'm currently using the default settings for Neo4j but since my graph is so small I'm not sure that should matter.
= Question
There is obviously something wrong with the query pattern that I'm using. How should I be querying for nodes when a lot of the relationships are optional? I have instances throughout my database where nodes can optionally be connected to lots of other nodes (which was the primary reason for switching over to a graph database). Thanks!
//console
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment