Shalin Says...

DataImportHandler is a Apache Solr module that provides a configuration driven way to import data from databases, XML and other sources into Solr in both “full builds” and incremental delta imports.

A large number of new features have been introduced since it was introduced in Solr 1.3.0. Here’s a quick look at the major new features:

Error Handling & Rollback

Ability to control behavior on errors was an oft-request feature in DataImportHandler. With Solr 1.4, DataImportHandler provides configurable error handling options for each entity. You can specify the following as an attribute on the “entity” tag:

  1. onError=”abort” - Aborts the import process
  2. onError=”skip” - Skips the current document
  3. onError=”continue” - Continues as if the error never occurred
All errors are still logged regardless of the selected option. When an import aborts, either due to an error or a user command, all changes to the index since the last commit are rolled back.

Event Listeners

An API is exposed to write listeners for import start and end. A new interface called EventListener has been introduced which has a single method:

public void onEvent(Context ctx);

For example, the listener can be specified as:
<document onImportStart="com.foo.StartListener" onImportEnd="com.foo.EndListener">
Push data to Solr through DataImportHandler

In Solr 1.3, DataImportHandler was pull based only. If you wanted to push data to Solr e.g. through a HTTP POST request, you had no choice but to convert it to Solr’s update XML format or CSV format. That meant that all the DataImportHandler goodness was not available. With Solr 1.4, a new DataSource named ContentStreamDataSource allows one to push data to Solr through a regular POST request.

Suppose one wants to push the following XML to Solr and use DataImportHandler to parse and index:

<root>
<b>
<id>1</id>
<c>Hello C1</c>
</b>
<b>
<id>2</id>
<c>Hello C2</c>
</b>
</root>

We can use ContentStreamDataSource to read the XML pushed to Solr through HTTP POST:

<dataConfig>
<dataSource type="ContentStreamDataSource" name="c"/>
<document>
<entity name="b" dataSource="c" processor="XPathEntityProcessor"
forEach="/root/b">
<field column="desc" xpath="/root/b/c"/>
<field column="id" xpath="/root/b/id"/>
</entity>
</document>
</dataConfig>

More Power to Transformers

New flag variables have been added which can be emitted by custom Transformers to skip rows, delete documents or stop further transforms.

New DataSources
  • FieldReaderDataSource - Reads data from an entity’s field. This can be used, for example, to read XMLs stored in databases.
  • ContentStreamDataSource - Accept HTTP POST data in a content stream (described above)
New EntityProcessors
  • PlainTextEntityProcessor - Reads from any DataSource and outputs a String
  • MailEntityProcessor (experimental) - Indexes mails from POP/IMAP sources into a solr index. Since it required extra dependencies, it is available as a separate package called “solr-dataimporthandler-extras”.
  • LineEntityProcessor - Streams lines of text from a given file to be indexed directly or for processing with transformers and child entities.
New Transformers
  • HTMLStripTransformer - Strips HTML tags from input text using Solr’s HTMLStripCharFilter
  • ClobTransformer - Read strings from Clob types in databases.
  • LogTransformer - Log data in a given template format. Very useful for debugging.
Apart from the above new features, there have been numerous bug fixes, optimizations and refactorings. In particular:
  • Optimized defaults for database imports
  • Delta imports consume less memory
  • A ‘deltaImportQuery’ attribute has been introduced which is used for delta imports along with ‘deltaQuery’ instead of DataImportHandler manipulating the SQL itself (which was error-prone for complex queries). Using only ‘deltaQuery’ without a ‘deltaImportQuery’ is deprecated and will be removed in future releases.
  • The ‘where’ attribute has been deprecated in favor of ‘cacheKey’ and ‘cacheLookup’ attributes making CachedSqlEntityProcessor easier to understand and use.
  • Variables placed in DataSources, EntityProcessor and Transformer attributes are now resolved making very dynamic configurations possible.
  • JdbcDataSource can lookup javax.sql.DataSource using JNDI
  • A revamped EntityProcessor APIs for ease in creating custom EntityProcessors
There are many more changes, see the changelog for the complete list. There’s a new DIHQuickStart wiki page which can help you get started faster by providing cheat sheet solutions. Frequently asked questions along with their answers are recorded in the new DataImportHandlerFaq wiki page.

A big THANKS to all the contributors and users who have helped us by giving patches, suggestions and bug reports!

Future Roadmap

Once Solr 1.4 is released, there are a slew of features targeted for Solr 1.5, including:
  • Multi-threaded indexing
  • Integration with Solr Cell to import binary and/or structured documents such as Office, Word, PDF and other proprietary formats
  • DataImportHandler as an API which can be used for creating Lucene indexes (independent of Solr) and as a companion to Solrj (for true push support). It will also be possible to extend it for other document oriented, de-normalized data stores such as CouchDB.
  • Support for reading Gzipped files
  • Support for scheduling imports
  • Support for Callable statements (stored procedures)
If you have any feature requests or contributions in mind, do let us know on the solr-user mailing list.


Apache Lucene 2.9 has been released. Apache Lucene is a high performance, full-featured text search engine library written entirely in Java.

From the official announce email:

Lucene 2.9 comes with a bevy of new features, including:

  • Per segment searching and caching (can lead to much faster reopen among other things)
  • Near real-time search capabilities added to IndexWriter
  • New Query types
  • Smarter, more scalable multi-term queries (wildcard, range, etc)
  • A freshly optimized Collector/Scorer API
  • Improved Unicode support and the addition of Collation contrib
  • A new Attribute based TokenStream API
  • A new QueryParser framework in contrib with a core QueryParser replacement impl included.
  • Scoring is now optional when sorting by Field, or using a custom Collector, gaining sizable performance when scores are not required.
  • New analyzers (PersianAnalyzer, ArabicAnalyzer, SmartChineseAnalyzer)
  • New fast-vector-highlighter for large documents
  • Lucene now includes high-performance handling of numeric fields. Such fields are indexed with a trie structure, enabling simple to use and much faster numeric range searching without having to externally pre-process numeric values into textual values.
  • And many, many more features, bug fixes, optimizations, and various improvements.
Look at the release announcement for more details.

Congratulations to the Lucene team! Great work as always.

This is also the last minor release which supports Java 1.4 platform. The next release will be 3.0 with which deprecated APIs will be removed and Lucene will officially move to Java 5.0 as the minimum requirement.

Solr 1.4 is not far behind and we hope to release it within two weeks.

I saw an advertisement today for taking a survey on Tomcat to help define it’s future directions. I don’t usually click on ads but this one seemed interesting so I did. It was a short one (thanks guys!) so I didn’t mind completing it. What I did not like so much was the focus on questions on how Tomcat can compete with “enterprise” application servers. What is “enterprise” anyway? If it’s about performance then Tomcat is enterprise ready. It doesn’t really matter if other commercial servers can do a few more requests per second on artificial tests. Tomcat is free (as in freedom)!. That’s a huge advantage.

Tomcat is the most widely used application deployment platform at AOL. As with most other web companies, we don’t really need all the cruft, cost and complexity which “enterprise” application servers bring in. It is a tried and tested platform with good performance characteristics, easy administration and monitoring. It scales well enough. We run thousands of Tomcats serving hundreds of millions of pages. Since it is free, we don’t need to scale by buying expensive servers, we can just scale out by adding more low cost servers (which, by the way, also adds redundancy)

I don’t think Tomcat’s goal should be to add features (read complexity) to compete with the so called enterprise application servers. It should continue to focus on being a performant Servlet/JSP container with easy development, administration and monitoring support. What I’d like to see added to Tomcat is:

  • Easy ways to use Tomcat in an embedded fashion (like Jetty)
  • Improve Tomcat manager
  • Easy configuration for my webapp (Properties vs. JNDI/context)
I’m not a guy who adds features just for the heck of them. So I’ll give use-cases for each of the above requests. These use-cases come from my own recent work.

Easy ways to use Tomcat in an embedded fashion

I don’t see myself shipping a product with an embedded tomcat but I’ve frequently needed to have an embedded container for unit/integration testing REST APIs. Sometimes, I’ve used Jetty, other times I’ve mocked stuff. All of my production deployments use Tomcat so it is only natural that I use Tomcat for integration testing. Solr uses Jetty for testing and for providing a standalone example which works great. I like the easy embeddable nature of Jetty. However, I also believe that part of the reason behind the popularity of Jetty was that Tomcat was not embeddable enough. It had a lot of strongly coupled extra features (and a lot of related code) which were not needed. Valves, realms, JNDI contexts, authentication and clusters are things which are generally not needed in the embeddable scenarios. Note that embedding Tomcat is possible but it is not documented that well and there is no easy way to find out all the dependency jars I’d need to do that. The last time I did this successfully with Maven, I had to track down the dependencies myself and add each one to my application to make it work. So easy for me means: publish latest jars to Maven, use the dependency structure that Maven provides, make it easy for me to remove the extra features I don’t need, focus on keeping the artifacts smaller and have good documentation on how to use the API.

Improve Tomcat Manager

A few months ago, I worked on a deployment application to push code updates to Tomcat servers across data centers. The use-case is simple. I want to update my application’s code without causing a downtime. So I drain traffic away from individual servers, update the code, verify that it is in fact updated, and redirect traffic back to the server. I worked with the features provided by the Tomcat manager application. Not too many people actually use the Manager in the name of security but that’s a separate topic. I wanted to add some custom commands to the manager and I couldn’t because it was not designed to be extensible. In the end, I had to copy code from Tomcat’s sources and modify to make it work. This is an area which could use some improvement. Coupled with good documentation on how to securely use the manager application, it has the potential to be used more widely. I want to use the Tomcat manager application from certain whitelisted IPs and only with SSL. Sounds simple but it was damn hard to get it working the way we wanted.

Easy configuration for my webapp

Configuration is a difficult issue. There are always so many right ways depending on who you ask. I just need to provide some key/value pairs to my application which change rarely but when they do, I’d really like them to be reloaded without bringing my application down. I’d really like to push those into the war but then I’d have different wars for different environments (dev/qa/prod) and that’d make some people very nervous (why?). I could use JNDI but that is much more complicated to manage than it needs to be for my simple use-case. Sysadmins don’t like XML, and that is a well-known fact. It’s easier for everybody to modify properties files vs an XML file for simple key/value pairs. I want to hot reload them, just like Tomcat hot-loads wars dropped into the webapps directory but I guess you can’t do that. So I write my own small sweet library to read properties files from a certain location, checking every few minutes for changes to the file. If Tomcat itself had something similar, I’d just use that. I think it might be a very common use-case.

On a related note, I’m very excited about tomcat-lite, comet/bayuex and the new servlet API (asynchronous servlets) coming into Tomcat. I also wish for an easy way to write non-HTTP applications on top of Tomcat’s NIO stack (again de-coupling may help) but that maybe asking too much. I know it’s all do-o-cracy and I’m not doing my part. Someday I hope to contribute code rather than just ideas and complaints. For now, this is all I have.

Adoption of Apache Solr is accelerating. Being accessible though HTTP makes it possible for Solr (a Java webapp) to be used with any language. All you need is support for making HTTP calls and parsing one of the many available formats such as XML or JSON.

Drupal

Drupal is one of the most popular CMS available as open source. It is written in PHP and boasts of a huge user and developer base. Recently, the Drupal community has integrated Apache Solr into Drupal for vertical search. The integration is available as a Drupal module at http://drupal.org/project/apachesolr. There are some excellent tutorials available on how to get started with using this as well as a hosted solution by Acquia.

Ruby

Ruby integration has been present in Solr since a long time. There is a module called solr-ruby as well as acts_as_solr. Solr even has a ruby response writer which outputs search results serialized in ruby. Blacklight is an open source project I know that uses Solr and is built in Ruby. Today, I came to know about SunSpot - A Solr powered search engine for Ruby. More details at this article in LinuxMag.

Python

Solr has a python response writer as well as many clients. See http://wiki.apache.org/solr/SolPython for details. Reddit is one site that uses Solr with a python front-end application. There is also HayStack for Django which can use Solr among other engines such as Xapian and Whoosh.

Solr 1.4 is nearing release with a number of features and performance improvements. On the other hand, Lucene is getting ready for near real-time search as well. Things are getting interesting in the Solr world!

There is a large amount of work being done in Lucene 2.9, in which a large portion is related to adding support for near real-time search.

To put it very simply, search engines transfer a lot of work from query-time to index-time. The reason this is done, is to speed up queries at the cost of adding documents slower. Until now, Lucene based systems have had problems with dealing with scenarios in which the searchers need to see the changes instantly (think Twitter Search). There exist a variety of tricks and techniques to acheive this even now. However, near real-time search support in Lucene itself is a boon to all those people who have been building and managing such systems because the grunt work will be done by Lucene itself.

This is still under development and will probably take a few more months to mature. Solr will benefit from it as well but before that can happen, a lot of work will be needed under the hood particularly in the way Solr handles its caching.

Michael McCandless has summarized the current state of Lucene trunk in this email on java-dev mailing list. In fact, there is so much activity that, at times, it becomes very difficult to follow all the excellent discussions that go on. There are some very talented people on that forum and it is a lot of learning for a guy like me, who started with Solr and is still trying to find his way in the Lucene code base.

Lucene 2.9 will bring huge improvements and I’m looking forward to working with other Solr developers to integrate them with Solr.


Google has announced support for building Java applications on the App Engine platform. This is great news for new App Engine developers and especially for those Java developers who had to learn Python to use App Engine.

I created a project for App Engine using Maven for builds. These were the steps I needed to follow:

1. Publish the App Engine libraries to the local Maven repository. Goto the app-engine-java-sdk directory (where app-engine sdk is installed) and execute the following commands:

mvn install:install-file -Dfile=lib/appengine-tools-api.jar -DgroupId=com.google -DartifactId=appengine-tools -Dversion=1.2.0 -Dpackaging=jar -DgeneratePom=true

mvn install:install-file -Dfile=lib/shared/appengine-local-runtime-shared.jar -DgroupId=com.google -DartifactId=appengine-local-runtime-shared -Dversion=1.2.0 -Dpackaging=jar -DgeneratePom=true

mvn install:install-file -Dfile=lib/user/appengine-api-1.0-sdk-1.2.0.jar -DgroupId=com.google -DartifactId=appengine-sdk-1.2.0-api -Dversion=1.2.0 -Dpackaging=jar -DgeneratePom=true

mvn install:install-file -Dfile=lib/user/orm/datanucleus-appengine-1.0.0.final.jar -DgroupId=org.datanucleus -DartifactId=datanucleus-appengine -Dversion=1.0.0.final -Dpackaging=jar -DgeneratePom=true

2. Create a maven pom file. This is the one that I used:

<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/maven-4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.shalin</groupId>
<artifactId>test</artifactId>
<packaging>war</packaging>
<version>1.0</version>
<name>Test</name>
<url>http://shalinsays.blogspot.com</url>

<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>

<dependency>
<groupId>com.google</groupId>
<artifactId>appengine-tools</artifactId>
<version>1.2.0</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>com.google</groupId>
<artifactId>appengine-local-runtime-shared</artifactId>
<version>1.2.0</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>com.google</groupId>
<artifactId>appengine-sdk-1.2.0-api</artifactId>
<version>1.2.0</version>
<scope>compile</scope>
</dependency>

<dependency>
<artifactId>standard</artifactId>
<groupId>taglibs</groupId>
<version>1.1.2</version>
<type>jar</type>
<scope>runtime</scope>
</dependency>
<dependency>
<artifactId>jstl</artifactId>
<groupId>javax.servlet</groupId>
<version>1.1.2</version>
<type>jar</type>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>org.apache.geronimo.specs</groupId>
<artifactId>geronimo-el_1.0_spec</artifactId>
<version>1.0.1</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>org.apache.geronimo.specs</groupId>
<artifactId>geronimo-jsp_2.1_spec</artifactId>
<version>1.0.1</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.geronimo.specs</groupId>
<artifactId>geronimo-servlet_2.5_spec</artifactId>
<version>1.2</version>
<scope>provided</scope>
</dependency>

<dependency>
<groupId>org.apache.geronimo.specs</groupId>
<artifactId>geronimo-jpa_3.0_spec</artifactId>
<version>1.1.1</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>org.apache.geronimo.specs</groupId>
<artifactId>geronimo-jta_1.1_spec</artifactId>
<version>1.1.1</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>org.datanucleus</groupId>
<artifactId>datanucleus-appengine</artifactId>
<version>1.0.0.final</version>
<scope>compile</scope>
</dependency>

</dependencies>
<build>
<finalName>test</finalName>
<plugins>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>1.5</source>
<target>1.5</target>
</configuration>
</plugin>
</plugins>
</build>
</project>
3. Create the standard maven directory structure and add the pom.xml in the same directory as the src directory.

You’re done!

I tested this with a simple servlet based application and it worked fine. I did not test the JPA/JDO integration so it might be a little rough around the edges. But it should work for the most part. Note, App Engine supports Java 6. If you want to use Java 6, you can change the “source” and “target” in the build section to 1.6 instead of 1.5


Apache Mahout 0.1 has been released. Apache Mahout is a project which attempts to make machine learning both scalable and accessible. It is a sub-project of the excellent Apache Lucene project which provides open source search software.

This is also the first public release of Taste collaborative filtering project ever since it was donated to Apache Mahout last year.

From the official announce email:

The Apache Lucene project is pleased to announce the release of Apache Mahout 0.1. Apache Mahout is a subproject of Apache Lucene with the goal of delivering scalable machine learning algorithm implementations under the Apache license. The first public release includes implementations for clustering, classification, collaborative filtering and evolutionary programming.

Highlights include:
  1. Taste Collaborative Filtering
  2. Several distributed clustering implementations: k-Means, Fuzzy k-Means, Dirchlet, Mean-Shift and Canopy
  3. Distributed Naive Bayes and Complementary Naive Bayes classification implementations
  4. Distributed fitness function implementation for the Watchmaker evolutionary programming library
  5. Most implementations are built on top of Apache Hadoop (http://hadoop.apache.org) for scalability
Look at the announcement for more details - http://www.nabble.com/-ANNOUNCE—Apache-Mahout-0.1-Released-td22937220.html

There is a lot of interest in Mahout from the community and it had a successful year with the Google Summer of Code 2008 program. This year again, there have been multiple proposals and I’m sure that great things are on the way.

The Apache Mahout Wiki has a lot of good documentation on the project as well as on machine learning in general. Their mailing list is very active and of course, they have some great people involved, see the committers page. I would encourage every student interested in machine learning to participate in the project.

I wish good luck to the project and the people involved in it. Keep up the great work!

Multi-select faceting is a new feature in the, soon to be released, Solr 1.4. It introduces support for tagging and excluding filters which enables us to request facets on a super-set of results from Solr.

The Problem

Out-of-the-box support for faceted search is a very compelling enhancement that Solr provides on top of Lucene. I highly recommend reading through the excellent article by Yonik on faceted search at Lucid Imagination’s website, if you are not familiar with it.

Faceting on a field provides a list of (term,document-count) pairs for a given field. However, the returned facet results are always calculated on the current resultset. Therefore, whatever the current results are, the facets are always in sync with the results. This is both an advantage as well as a disadvantage.

Let us take the search UI for finding used vehicles on the Vast.com website. There are facets on the seller’s location and the vehicle’s model. Let us assume that the Solr query to show that page looks like the following:
q=chevrolet&facet=true&facet.field=location&facet.field=model&facet.mincount=1


What happens when you select a model by clicking on, say “Impala”? The facet for vehicle model disappears. Why? The reason is that now only “Impala” is being shown and there are no other models present in the current result set. The Solr query looks like the following now:
q=chevrolet&facet=true&facet.field=location&facet.field=model&facet.mincount=1&fq=model:Impala

So what is wrong with this? Nothing really. Except that for ease of navigation, you may still want to show all other models and document-counts which were being shown in the super-set of the current results (the previous page). But, as we noted a while back, the facets are shown for the current result set, in which all the models are Impala. If we attempt to facet on models field with the filter query applied, we will get a list of all models. But, except for “Impala”, all other models will have a zero document count.

Solution #1 - Make another Solr query

Make another call to Solr without the filter query to get the other values. Our example query would look like:
q=chevrolet&facet=true&facet.field==model&facet.mincount=1&rows=0
The rows=0 is specified because we don’t really want the actual results, just the facets for the model field. This is a solution that can be used with any version of Solr. However, it is one additional HTTP request. Even though it is a bit inconvenient, this is usually fast enough. However, an additional call is expensive if you are using Solr’s Distributed Search which will send one or more queries to each shard.

Solution #2 - Tag and exclude filters

This is where multi-select faceting support comes in handy. With Solr 1.4, it is possible to tag the filter queries with a name. Then we can exclude one or more tagged queries when requesting for facets. All of this happens through additional metadata that is added to request parameters through a syntax called Local Params.

Let us go step-by-step and change the query in the above example and see how the request to Solr will look like.

1. The original request in the above example without tagging:
q=chevrolet&facet=true&facet.field=location&facet.field=model&facet.mincount=1&fq=model:Impala
2. The filter query tagged with ‘impala’:
q=chevrolet&facet=true&facet.field=location&facet.field=model&facet.mincount=1&fq={!tag=impala}model:Impala
3. The facet field with the ‘impala’ filter query excluded:
q=chevrolet&facet=true&facet.field=location&facet.field={!ex=impala}model&facet.mincount=1&fq={!tag=impala}model:Impala
Now, with this one query, you can get the facets for current results as well as for the super-set without the need to make another call to Solr. If you want Solr to return this particular facet field under an alternate name, you can add a ‘key=alternative-name’ local param. For example, the following Solr query will return the ‘models’ facet under the name of ‘allModels’:
q=chevrolet&facet=true&facet.field=location&facet.field={!ex=impala key=allModels}model&facet.mincount=1&fq={!tag=impala}model:Impala
Tagging, filtering and renaming is not just limited to facet fields. It can be used with facet queries, facet prefixes and date faceting too.

This is another cool contribution by Yonik (also see my previous post). I’m really looking forward to the Solr 1.4 release. It is bringing a bunch of very useful features including the super-easy-to-setup Java based replication. But more on that in a later post.

Yonik Seeley recently implemented a new method for faceting which will be available in Solr 1.4 (yet to be released). It is optimized for faceting on multi-valued fields with large number of unique terms but relatively low number of terms per document. The new method has made a large improvement in performance for faceted search and has cut memory usage at the same time.

Background

When you facet on a field, Solr gets the list of terms in a field across all documents, executes a Filter Query for each term, caches the set of documents matching the filter, intersects it against the current result set and gives the count of documents matched for each term after the intersection. This works great for fields which have few unique values. However, it requires a large amount of memory and time when the field has a large number of unique values.

UnInvertedField

The new method uses an UnInvertedField data structure. In very basic terms, for each document, it maintains a list of term numbers that are contained in that document. There is some pre-computation involved in building up this data structure, which is done lazily for each field, when needed. If a term is contained in too many documents, it is not un-inverted. In this new method, when you facet on a field, Solr iterates through all the documents, summing up the number of occurrences of each term. The terms which were skipped while building the data structure use the older set intersection method during faceting.

This data structure is very well optimized. It doesn’t really store the actual terms (string). Each term number is encoded as a variable-length delta from the previous term number. A TermIndex is used to convert term numbers into the actual value for only those terms which are needed after faceting is completed (the current page of facet results). The concept is simple but if not implemented in an efficient way, it may impair performance rather than improve it. Therefore, there are a lot of important optimizations in the code.

Performance

Yonik benchmarked the performance of the new method against the old and his tests show a lot of improvement in faceting performance, sometimes by an order of magnitude (upto 50x). The improvement is much more significant as the number of unique tokens are increased.

For a comprehensive performance study, see the comment on the Jira issue about performance here and the document here.

There are a few ideas in the code comments which give directions on possible future optimizations. But the improvement from the old method are already quite massive, probably the law of diminishing returns will hold true here.

The structure is thrown away and re-created lazily on a commit. There might be a few concerns around the garbage accumulated by the (re)-creation of the many arrays needed for this structure. However, the performance gain is significant enough to warrant the trade-off.

Conclusion

The new method has been made the default method for faceting on non-boolean fields in the trunk code. It will be released with Solr 1.4 but it is already available in the trunk and nightly builds. If you are comfortable using the nightly builds, you are welcome to try it out.

A new request parameter has been introduced to switch to the old method if needed. Use facet.method=fc for the new method (default) and facet.method=enum for the old one.

Note - “Inside Solr” is a new feature that I hope to write regularly. It is intended to give updates about new features or improvements in Solr and at the same time, to describe the implementation details in a simple way. I invite you to give feedback through comments and tell me about what you would want to read about Solr.

Sharing a few interesting articles I read in the past few weeks on the interweb about Twitter, LinkedIn, Ebay and Google.

Improving running components at Twitter describes the evolution of Twitter’s technology and about their new message queue server, named Kestrel, written in approximately 1.5K lines of Scala.

LinkedIn Communication Architecture details the heavy usage of Java, Tomcat, Jetty, Lucene, Spring and ActiveJMX at LinkedIn. Oracle and MySQL are used for data storage. They have made heavy customizations to Lucene for their near real-time indexing needs. They have open-sourced their Lucene modifications in the form of Zoie on Google Code. The upcoming Lucene In Action 2 has a case-study on how Zoie builds upon Lucene.

The eBay way is a presentation on eBay’s realtime personalization system. This mammoth system handles 4 billion reads/writes per day. The interesting thing about this system is that it uses the MySQL memory engine as a caching tier in front of a persistent tier. Some critical data is replicated (presumably on the cache tier as they talk about doubling memory needs). They encountered problems with the single-threaded MySQL replication, so it is managed through dual writes instead (the second write can be asynchronous). The system is capable of automatic redistribution of data if a node goes down.

Jeff Dean’s WSDM keynote slides on the evolution of Google’s search infrastructure are perhaps the most interesting of all. It has gone through a number of iterations over the years. I was surprised to know that their complete index is served out of memory. Although it makes sense with the fact that as they increased the number of nodes, they crossed a point where they had enough combined memory to hold the index completely.