Lucene, SOLR, Hadoop and Pig

I haven’t posted anything in a VERY LONG time. Too busy coding, I guess. It’s a shame because in the year (or so) a lot has happened, and a lot more is about to happen. Good times…

“May you live in interesting times” – ancient Asian curse.

First, the project that I was busy working on for nearly a year, was put on “suspended animation” (more on this later, maybe).

Next, the programmer who worked on the hub our data distribution system left. Since I had “time on my hands” (as if!!) I inherited this project.

This is where things have gotten interesting. Most people find “interesting” exciting, unfortunately, “interesting” along with “deadlines” is not always a good mix.

The company I’m working for is in the information business. If you need to know something about somebody, come to our site and look them up. There are trillions of records regarding millions of persons and everything they’ve ever done in the public eye. When people come to our site, they don’t always know all the particulars about a person of interest, so we have to allow them to search for that person. Oh, don’t forget to make it a quick search so that the user has a pleasant experience on our site!

Imagine having to search for birth records for someone named “Joe something-or-other born somewhere in New York city, or was it New York County….”

Yeah. Like that. All day long.

To handle search requests such as that (well, maybe not that bad) we use a search engine named SOLR, which is based on a search technology named Lucene, products in the Apache tool chest.

Prior to my coming to work at this company, I’d run into this technology exactly NEVER. So it was quite an eye opener. Like a lot of .NET developers, I’ve spend my fair share of time writing stored procedures and developing database systems designed to sell you the shoes that you want, or the shirt your father wants, etc.

But this, is nothing like that. At all.

In this line of work, we spend a lot of time contemplating things like how many hard-drive platters will have to spin to retrieve a series of records needed to fulfill a search request, and how to minimize the impact on the end user. Much bigger data than in the past.

The basic setup that we use to retrieve information requested by a end user it roughly like this:

  1. User enters their search criteria into front-end app
  2. Front-End app sends its data request to its middle-tier layer where the search parameters are recorded, the request is (possibly) re-packaged and sent to my search layer
  3. The search layer re-packages the search parameters into a SOLR compatible expression and then requests pertinent data from SOLR.
  4. Pertinent records are returned from SOLR and passed back my search layer application, where the results are updated, flipped, folded, groomed, filtered and eventually sent back to the calling website.
  5. The website then slaps any final adjustments to the data and then displays it to the customer.

All of this:  0.5 seconds or less.

We have not made any changes to the SOLR engine other than to update some “config” files used at startup. SOLR is pretty much running in an “out of the box” configuration. My search application layer is another story.

The original developer of this system is a mad-man when it comes to code. He sleeps it, dreams it, practically wears it on this clothes. Me: not so much.

The hardest part of this whole transition has been to learn where a particular functionality exists in the mountain of code that I’ve inherited. Although I have a much better handle on where code is located, knowing what it does, is another story.

“The code is self-documenting”

No. It’s not. That famous idiom was provided to me when I tried to confirm if there was some sort of “roadmap” to the layout of the project. I’d heard that idiom several times over the years, and every time I’d heard that, it was plain to see that no, the code was most certainly NOT self-documenting.

I strive to add plenty of comments to my code. I too fall out of the habit of commenting as I go, and invariably, I end up adding in comments at a later time, usually when I can’t remember why in the dickens I did something so foolish as whatever I’m currently working on. So, please, be a good programmer and throw me (and your fellow programmers) a bone and throw some comments in your code. No matter how obvious you think something is, it probably isn’t.

Back to SOLR……

SOLR is a search engine that allows to search for a “phrase” in the “documents” that it has stored, and has loaded in memory. For example, you have a data table that contains a product name, manufacturer name, size, color, SKU, and a text description. You would construct a “document” from this table, one document per table row (we load our indexes using XML format), and submit the document into SOLR for indexing.

Once your documents are submitted into the index, you can search for information contained in those documents.

SOLR was described to me as an inverted index. An English major would have used the term “glossary” and it would have been more accurate. But what do they know?

So, using our products table from above, someone could come to your site and enter “men’s blue dress shirt” and SOLR will comb through the records it has indexed and return all the documents that have “men”, “blue”, “dress” or “shirt” in any one of the fields we indexed. For  example SOLR may find “men” in the “gender” field, “blue” in the “color” field, “shirt” in the “product name” field and “dress” in the “product description” field.

Of course this is VERY high-level description of what SOLR does. If you’re curious, you should check it out at the Apache SOLR site.

Hadoop – Had-a-what??

By the time my predecessor had left, we had two versions of SOLR in place: SOLR 3.x for “people” based searches and a new SOLR 4.7 installation for “public record” based searches. But the consensus was that the improvements in SOLR 4 were significant enough that we want to move our “people” searches on to SOLR 4 based hardware.

The problem is that we have a lot of data. How much? Imagine 1000 records of various types for every person that has been alive in the US since around 1930. See what I mean? LOTS OF DATA. All of it indexed into our SOLR 3.x farm. Now, we want create new indexes for these people in our SOLR 4 farm. The problem is that the new “data format” does not appear to be backwards-compatible with our SOLR 3 based indexes. So we’re going to have to build new indexes from scratch.

Basically, we would read back every document stored in the SOLR 3 system and index into our SOLR 4 system. The problem is that it will take about six weeks to accomplish (that’s how ling it takes to “update the records” in the current system (details if you ask nicely).

My predecessor had recommended harnessing the power of “the cloud” to speed up some of this work. In the cloud, we could have many (thousands?) of machines crunching away on our re-indexing of our data and have the job done in a matter of hours, instead of weeks (we hope). To do this, we would use Hadoop.

Hadoop is solution for distributed computing. It allows you to take one large job, and dole it out over X number of small computers, each doing it’s part of the large job. Then the many small outputs are re-assembled into one large output, that will become (in our case) our new SOLR 4 formatted index.

I’ve never used Hadoop, and I’ve never used the cloud like this. So it’s going to be an eye-opening (and most likely very painful) experience in the next few weeks / months.

Distributed processes that Hadoop can work with come in a variety of flavors. Our database administrator has recommended using Apache Pig since we are working with a data project. Pig is a project for working with large datasets, and/or in parallel, like a Hadoop project. See where we’re going with this? (I guess I should invest in Apache???).

Anyway, I know the “big pieces” with which I will become intimately familiar in the next few weeks/ months. But like a mountain on the horizon, it become much more imposing the closer you get.

I’ll try to keep posting stuff on here as I come across (and hopefully conquer) these challenges. Wish me luck!

Leave a Comment

Your email address will not be published. Required fields are marked *