Hibernate: Understand FlushMode.NEVER

Posted by tfenne under Java


I've complained before about how the Hibernate documentation doesn't provide much help for those of us dealing with large data sets. The established point of view seems to be that "large amounts of data == batch processing" and "hibernate/java is not the best place for batch processing." Seeing as the system I'm working on currently (a DNAresequencingLIMS) has a complex data model and very large amounts of OLTP data, I'm at odds with the first piece of accepted knowledge. We deal with large (thousands of objects) complex object webs in memory to implement use cases driven from a user interface in human-time (i.e. non-batch).

One of the more useful pieces of advice I've seen meted out on the hibernate forums was a trite statement from Gavin to some poor user to "understand FlushMode.NEVER". Well, I've recently had occasion to understand it, and I thought I'd share. To use FlushMode.NEVER you might write something like:

session.setFlushMode(FlushMode.NEVER);

This tells the hibernate session that it is not to flush any state changes to the database at all, ever - unless told to do so by the application programmer calling session.flush() directly. That's fairly logical, and is useful in it's own right, but why does it have a big impact on performance in some scenarios?

To explain, let's take a look at a fairly typical use case of mine (if you don't understand the science, don't worry). It involves automatically picking the "best" available PCR primers for the genes/exons being resequenced in an experiment, and goes something like this:

  1. Find all the exons/targets in the experiment (ranges up to 5000)
  2. Find all the PCR primers that have already been ordered by the experiment, and worked
  3. Find all the PCR primers that are in queue to be ordered by the experiment
  4. Net out the primers against the targets to find what isn't covered
  5. For each un-covered area query for all possible PCR primers in the system
  6. Examine each matching primer and pick the best one for each target
  7. Save the picked PCR primers to the experiment

Because our domain model is fairly rich, by step 4 we've probably got on the order of 20-30,000 objects in the session! And then each query (it's actually two queries, but that's not so important) in steps 5/6 brings in anywhere from 0 to 20 additional objects. What's interesting though is that while we're pulling in a lot of objects, none of the objects we're loading are getting modified. Only new objects are getting created and saved at the end.

Back to FlushMode.NEVER then. You might assume that since we're not persisting or changing anything until the last step, changing the flush mode would have little consequence. You'd be wrong of course. In actuality setting flush mode to NEVER at the start of this flow and then flushing manually at the end caused the run time to drop by over half!

The reason for this is the way hibernate implements dirty checking. Every time you load an object into memory (and don't evict it) the session keeps track of it in case it changes. (Of course, you could evict it - but then any lazy loaded relationships won't be able to load.) So, any time you perform a query, the session iterates over all objects in the session checking dirtiness and flushes any dirty objects to the database. It does this to ensure that all state changes that might affect the query are present in the database before issuing the query to it. That's fine when you have only a few objects in session, but when you have thousands and are performing thousands of queries, it becomes a real drain on performance.

What I'm yet to understand is exactly why hibernate does it this way. Perhaps it's the developers' belief that improving performance for large data sets isn't worth the additional complexity? What am I talking about? Well, a lot (the vast majority) of the objects that get loaded into memory are proxied by hibernate anyway. These don't need to be dirty checked when queries are made because they could actively be flagged as dirty when changes are made through the proxy, and added directly to a dirty list. It would seem to me that if hibnerate always returned proxied instances - even on non-lazy loads - this problem could practically disappear. There's still the case when someone persists a new object and then tries to modify it without replacing their reference with a newly proxified reference- but even if those instances were tracked the old way, it would hardly affect performance in most use cases like this.

Anyway, the moral of the story is this: in use cases with lots of reading/querying and lots of objects in session you should use FlushMode.NEVER as long as you're not modifying data that will affact your queries. Example psuedo-code is below:

FlushMode previous = session.getFlushMode();
session.flush(); // who know's what been done till now
session.setFlushMode(FlushMode.NEVER);
// Do some querying
// Do some more querying
// Really load up that session
// Execute a few more queries
// Write back to some tables
session.flush();
session.setFlushMode(previous);