Thursday, October 4, 2012

10 Caveats Neo4j users should be familiar with

@YaronNaveh

UPDATE: Michael Hunger from neo4j responds to some of my items in a comment.

Recently I used the Neo4j graph database in GitMoon. I have gathered some of the tricky things I learned the hard way and I recommend any Neo4j user to take a look.

1. Execution guard
By default queries can run forever. This means that if you have accidently (or by purpose) sent the server a long running query with many nested relationships, your CPU may be busy for a while. The solution is to configure neo to terminate all queries longer than some threshold. Here's how you do this:

in neo4j.properties add this:

execution_guard_enabled=true

Then in neo4j-server.properties add this:

org.neo4j.server.webserver.limit.executiontime=20000

where the limit value is in milliseconds, so the above will terminate each query that runs over 20 seconds. Your CPU will thank you for it!


2. ID (may) be volatile
Each node has a unique ID assigned to it by neo. so in your cypher you could do something like:

START n=node(1020) RETURN n

START n=node(*) where ID(n)=1020 return n

where both cyphers will return the same node.

Early on I was tempted to use this ID in urls of my app:

/projects/1020/users


This was very convinient since I did not have a numeric natual key for nodes and I did not want the hassle of encoding strings in urls.

Bad idea. IDs are volatile. In theory, when you restart the db all nodes may be assigned with different IDs. IDs of deleted nodes may be reused for new nodes. In practice, I have not seen this happen, and I believe that with the current neo versions this will never happen. However you should not take it as guaranteed and should always come up with your own unique way to identify nodes.

3. ORDER BY lower case
There is no build in function that allows you to return results ordered by some field in lower case. You have to maintain a shadow field with the lower case values. For example:

RETURN n.name ORDER BY n.name_lower

4. Random rows
There is no built in mechanism to return a random row.

The easiest way is to use a two-phase randomization - first select the COUNT of available rows, then SKIP rows until you get to that row:

START n=node(*)
WHERE n.type='project'
RETURN count(*)

// result is 1000
// now in your app code you make a draw and the random number is 512

START n=node(*)
WHERE n.type='project'
RETURN n
SKIP 512
LIMIT 1

An alternative is to use statistical randomization:

START n=node(*)
WHERE n.type='project' AND ID(n)%20=0
RETURN n
LIMIT 1

Where 20 is number you generated in your code. Of course this will never be fully randomized, and also requires some knowledge on the values distribution, but for many cases this may be good enough.


5. Use the WITH clause when cypher has expensive conditions
Take a look at this cypher:

START n=node(...), u=node(...)
MATCH p = shortestPath( n<-[*..5]-u) WHERE n.forks>20 AND length(p)>2
RETURN n, u, p

Here we will calculate the shortest path for all noes. This is a cpu intensive operation. How about separating concerns like this:

START n=node(...), u=node(...)
WHERE n.forks>20 AND length(p)>2
WITH n as n, u as u
MATCH p = shortestPath( n<-[*..5]-u ) WHERE length(p)>2
RETURN n, u, p

now the path is only calculated on relevant nodes which is much cheaper.


6. Arbitrary depth is evil
Always strive to limit the max depth of queries you perform. Each depth level increases the query complexity:

...
MATCH (n)<-[depends_on*0..4]-(x)
...

7. Safe shutdown on windows
When you run Neo4j on windows in interactive mode (e.g. not a service) do not close the console with the x button. Instead, always use CTRL+C and then wait a few seconds until the db is safety closed and the window disappears. If by mistake you did not safely close it then the next start will be slower (can take a few minutes or more) since neo will do recovery. In that case the log (see #8) will show this message:

INFO: Non clean shutdown detected on log [C:\Program Files (x86)\neo4j-community-1.8.M03\data\graph.db\index\lucene.log.1]. Recovery started ...

8. The log is your best friend
When crap hits the fan always turn out to /data/log. Especially if neo does not start you may find out that you have misconfigured some setting or recovery has started (see #7)

9. Prevent cypher injection
Take a look at this code:

"START n=node(*) WHERE n='"+search+"' RETURN n"

if "search" comes from an interactive user then you can imagine what kind of injections are possible. The correct way is to use cypher parameters which any driver should expose an api for. If you use the awesome node-neo4j api by aseemk you could do it like this:

qry = "START n=node(*) WHERE n={search} RETURN n"
db.query qry, {search: "async"}

10. Where to get help
The Neo4j Google group or the community github project are very friendly and responsive.

@YaronNaveh

What's next? get this blog rss updates or register for mail updates!

5 comments:

Michael Hunger said...

Yaron,

thanks a lot for the blog post.

Some comments:

- currently internal-ID's are not stable for deleted nodes and rels as they can be reused, that might change in the future with different storage implementations
- can you add a code example for parameter usage?

your statement:

START n=node(...), u=node(...)
WHERE n.forks>20 AND length(p)>2
WITH n as n, u as u
MATCH p = shortestPath( n<-[*..5]-u ) WHERE length(p)>2
RETURN n, u, p

is not correct, the first "AND length(p)>2" is too much and also you can limit you path easily by the qualifier on the path expression: n<-[*2..5]-u

Usually cypher is clever enough to filter out nodes before the patterns are matched (like the n.forks example).

- RETURN count(n) there count(*) might be faster

Can you add some measures at the query where you talk about execution time? That would be great!

Yaron Naveh (MVP) said...

Hi Michael

Thanks for the response, I've updated the post to reference it.

1. Thanks for the clarification about IDs.

2. I've added an example for parameters.

3. The first length(p)>2 is definitely a mistake, only the second one should be there. The query in the post is a simplification of my real query but possibly it wasn't a good one. My real query is below. Looking at it now I may had better ways to approach it since I do not filter by path. Anyway I think the pattern I showed is important even if the example is not perfect. This is my real query, which runs faster with the WITH clause:

START n=node:node_auto_index(name='#{prj_name}')
MATCH (n)<-[depends_on*0..4]-(x)<-[:watches]-(u)
WHERE HAS(x.name) AND x.name<>'hoarders'
AND u.login =~ /(?i).*?#{login}.*?/
WITH distinct u as u, n as n
MATCH p = shortestPath( n<-[*..5]-u )
RETURN u as user, p as path, length(p) as len
ORDER BY LENGTH(u.gravatar_id) DESC, u.login_lower

4. Thanks for the count(n) tip, I've updated the post.

5. Do you mean the first item about execution guard? After by mistake I made some long queries the server was very slow even 20 minutes after, until I restarted it. Also when I was opening the data query tab in the management ui the "running" icon had a strange behavior of appearing and disappearing (this was consistent after browser restart). I do not have exact measurements. When asked about it in the group I was referred to this post which may give you more information:

https://groups.google.com/forum/#!topic/neo4j/5ec8FThLTeo/discussion

Michael Hunger said...

Yaron,

you should use parameters in your longer query too :)

Now I can see why it is faster, you do two variable length matches which lead to exponential explosion in the same query, also your with reduces the users by the distinct (btw. you don't have to do "n as n").

re 5. I meant in general, show some numbers or at least how much faster the queries became with the changes.

Cypher has bee filtering the nodes before the matches as far as it could for a while now. So that optimization is not so much about moving the where but about separating the two var-length matches and also the "distinct u".

Looking forward about your blog on the gitmoon concepts and how you implemented those. Would love to reblog it on blog.neo4j.org then.

Yaron Naveh (MVP) said...

Michael

The reason I do not use params in parts of the queries is this bug:

https://github.com/neo4j/community/issues/663

Instead of mixing params and embedded values I used the latter solely in some queries. I am doing very strict sanitization on input strings though.

Srisys said...

Hi, Thanks for sharing such a great information on codes & errors...

Highly appreciate your efforts!