Secondary Indexes in HBase

Fundamental problems in solution development can be deceiving – in particular when considered from a conceptual distance (i.e. being somewhat clueless). At one time they look too hard to address as one more of the many side topics one has to keep under control, another time they turn into something completely manageable – only just to hit you hard once you turn around again.

One such topic that kept me busy lately is secondary indexing for HBase stored data.

As such, HBase does not have any built-in support for secondary indexes. All data is ordered by row keys and that’s it (in fact, HBase does not store rows but rather versioned columns).

Why is it that HBase does not have secondary indexes while products in the same space, such as Cassandra and Hypertable do? I would put it this way:

Due to the distributed nature of Bigtable style database systems, there is no good single solution to secondary indexing – but one rather has to chose between trade-offs to find a suitable approach for a given use-case

(Start you exploration here).

That would not necessarily need to mean that there could not be a variety of implementations available out of the box. Unless there is some stronger type system support and more agreement and experience with the various approaches, I guess, this will however still take some time.

This post is on three different approaches I learned about lately and what I think are their main qualities (with some interesting links).

Global Index Table

This is the most obvious (and still useful) approach. In its simplest version: Given some row r with row key r.key to index by some field value r.val, store a mapping

${r.val} / ${r.key}

as row keys in another table. Finding all rows with r.val==x then simply means to scan for the row keys following “x/” . As invalid index entries can be readily identified by checking the row for the expected field value keeping a consistent index is really easy by making sure that index entries are written before row updates are written and by cleaning out stale entries once in a while. So, the implementation is really not that hard and and it is effective (see also Global Index Table references below). Due to its simplicity and robustness, it would be highly desirable to have some utility implementation generally available.

Furthermore it is space-efficient, indexes can easily be regenerated if things go bad, you need no further software but HBase, and you can be really smart about the value to index as its all application based.

There are significant downsides as well: Getting a single row via the index is two roundtrips and, worse, scanning rows via the index implies to – in the worst case – to have randomly distributed “point gets”.

scnd_index_gt

 

The latter is obviously the most significant downside and if you need to efficiently scan rows via an index this approach is not for you.

Region-level Index

An approach that is more focussed on adding efficiently to write secondary indexes to the database implementation is to add region server or region level indexes. That is, integrated with HBase’s Write-Ahead-Log (WAL), indexes are written directly and atomically for data within the region server local regions to the region server. So, every region server holds indexes for its data. This approach is implemented by HIndex and IHBase.

This is good for the write performance. Reading data via the index does however require to contact each region in question, as there is no way of knowing where not to look for index content (e.g. unlike the global index table that uses regular HBase row key ordering).

scnd_index_rli

Effectively that means that read-throughput does not scale up with the number of region servers, but due to parallelization it does not degrade with data size either. So, in a way, like a classic RDBMS it does not scale vertically in throughput but unlike a classic RDBMS it does scale vertically in space.

Covering Index

Covering indexes are essentially materialized views without a name. That is, instead of fishing for the data via some key indirection, the actual data is copied into the index table which employs row keys corresponding to the indexed data rather than the original row key.

So, index-based queries don’t need to go anywhere but just into the right table where all data is already present. Consequently read performance is great.

scnd_index_cover

Updating the index however is much more complicated now. All copies need to be updated when changing data – unlike the global table approach there is no simple verification possible while still preserving the performance attributes.

While this approach does perfectly preserve the scalability attributes of HBase, indexes may still come at a steep price: Every new index potentially duplicates a significant portion of the original data.

Secondary indexing in Phoenix, a SQL-implementation for HBase, is based on this approach. And indeed query-performance is one of the strong selling points while there is still great effort put in making sure index updates are essentially atomic (and WAL-integrated) and performing well.

Summary

Not being really deep into the implementation details of the more complicated approaches, I must admit that I find Region-Level-Indexing the least attractive option. It looks like something that is asking for trouble in large setups – which is after all exactly what you expect to happen, once you come looking for HBase.

Covering indexes are really interesting – for analytical applications: If you need fast query performance over a large data this looks good. In analytical applications you typically know about the dimensions that need to be present in result sets.

For OLTP-style applications scanning via indexes is often a less prominent access method compared to single but complete row retrieval for display and manipulation by a variety of conditions. That is, global index tables may still be the more space-efficient and flexible means to that end, while still being sufficiently read efficient.

Ideally it should be (and quite possibly is already) possible to combine both approaches.

————-

References

Indexing on Hbase

  1. Secondary Indexes and Alternate Query Paths, HBase documentation,
  2. Musings on Secondary Indexing in HBase, Jesse Yates
  3. HBASE-2038, HBase issue tracker discussion

Global index tables

  1. Consistent Enough Secondary Indexes, Jesse Yates
  2. Musings on Secondary Indexes, Lars Hofhansl

Region-Level indexing

  1. HIndex at Github
  2. HIndex at Intel
  3. IHBase

Covering Indexes

  1. Phoenix
  2. Phoenix Secondary Indexing

Other indexing tools for HBase

  1. Culvert
  2. Lily

Other Big Table style database products with secondary indexing

  1. Hypertable
  2. Cassandra (and here)