Shalin Says...

From the mailing list announcement:

Apache Solr 1.4.1 has been released and is now available for public
download!
http://www.apache.org/dyn/closer.cgi/lucene/solr/

Solr 1.4.1 is a bug fix release. See the change log for more details.

A couple of weeks back, Apache Lucene Committer and PMC member, Michael McCandless started a discussion on factoring out a shared, standalone Analysis package for Lucene, Solr and Nutch. During the discussions, Yonik Seeley, Solr Creator, proposed merging the development of Lucene and Solr. After intense discussions and multiple rounds of voting, the following changes are being put into effect:

  • Merging the developer mailing lists into a single list.
  • Merging the set of committers.
  • When any change is committed (to a module that “belongs to” Solr or to Lucene), all tests must pass.
  • Release details will be decided by dev community, but, Lucene may release without Solr.
  • Modularize the sources: pull things out of Lucene’s core (break out query parser, move all core queries & analyzers under their contrib counterparts), pull things out of Solr’s core (analyzers, queries).

The following things do not change:

  • Besides modularizing (above), the source code would remain factored into separate dirs/modules the way it is now.
  • Issue tracking remains separate (SOLR-XXX and LUCENE-XXX issues).
  • User’s lists remain separate.
  • Web sites remain separate.
  • Release artifacts/jars remain separate.

So what does it mean for Lucene/Solr users? Nothing much, really. Except that you should see tighter co-ordination between Lucene and Solr development. New Lucene features should reach Solr faster and releases should be more frequent. Solr features may also be made available to Lucene users who do not want to setup Solr use the RESTy APIs.

Already, Solr has been upgraded to use Lucene trunk (in branches/solr) and should soon become the new Solr trunk. There is talk of re-organizing the source structure to better fit the new model. Things are moving fast!

Personally, I feel that this merge is a good thing for both Lucene and Solr:

  • Solr users get the latest Lucene improvements faster and releases get streamlined.
  • Lucene users get access to Solr features such as faceting.
  • The in-sync trunk allows new features to make their way into the right place (Lucene vs Solr) more easily and duplication is minimized.
  • Bugs are caught earlier by the huge combined test suite.
  • More number of committers means more ideas and hands available to the projects
  • Other Lucene based projects can benefit too because many Solr features will be made available through Java APIs.

There are a couple of things to be worked out. For example, we need to decide where the integrated sources should live and whether or not to sync Solr’s version with Lucene’s. All this will take some time but I am confident that our combined community will manage the transition well.

According to a blog post from Microsoft Distinguished Engineer and CTO, FAST Bjørn Olstad, the 2010 products will be the last to have a search core that runs on Linux and UNIX.

Being involved in Apache Solr and the newly formed Lucene Connectors Framework (LCF) project, I’m very interested in the implications. Undoubtedly, at least some FAST customers will not be happy with this decision and will want to switch to something which can still run on Linux/UNIX.

I believe that this is a great opportunity for the Apache Solr/LCF combo. Perhaps, the newly proposed Apache Spatial Information Systems (SIS) will help as well. Of course, this is big news for Lucid Imagination, Sematext and other companies as well who offer consultancy, training and support for Lucene/Solr.

I’d like to ask people who have used FAST in the past, what would it take for Lucene/Solr/LCF to fill the gap?

Well, the cat is out of the bag. I’ve been working with Otis on Solr In Action. We’re looking for a couple of contributors to write case studies for the book describing how they have used Solr. Otis just posted this to his blog and to the Solr mailing list as well.

So, if you are are using Apache Solr in some clever, interesting or unusual way, or deal with large indexes or large number of cores or distributed search and are willing to share this information with the world, please get in touch. We are looking for between five to ten pages (soft limits) per case study.

You can contact either me or Otis by leaving a comment on this post with your contact info or contact @shalinmangar or @otisg on Twitter or email me on shalin at apache and we’ll get back to you right away.

It’d be great if you can share this message around too.

Erik has written about Solr’s usage in libraries on the Lucid Imagination Blog. Solr has found its way into many libraries and quite rightly so. However, one of the main things that Erik talks about in that blog post is the performance of DataImportHandler vs SolrMarc (the indexing library used by both VUFind and Blacklight).

Quoting from Erik’s email to the solrmarc-tech google group:

The difference in speeds:

SolrMarc: 22 docs / s
DIHmarc: 1,745 docs / s

W00t!

Well, I don’t know much about SolrMarc but I’ve seen DataImportHandler instances with comparable (or even better) throughput many times. There is something fishy inside SolrMarc for sure and I have a feeling that fixing it would be a low hanging fruit. However, Erik’s opinion is that DataImportHandler is a better way to index and that he will devote a portion of his time to helping the Solr using library community (thanks Erik!).

DIH has really taken off since it debuted in Solr 1.3 and it would be safe to call it the de-facto standard for indexing data into Solr. It may not be the most elegant way to index data but it is quick and it works great. With the planned features for DataImportHandler in Solr v1.5, it will continue to improve and it makes sense to base VUFind and Blacklight’s indexing infrastructure on top of it. I’m very excited to see this happening.


From the official announcement:

Apache Solr 1.4 has been released and is now available for public download!
http://www.apache.org/dyn/closer.cgi/lucene/solr/

Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document (e.g., Word, PDF) handling. Solr is highly scalable, providing distributed search and index replication, and it powers the search and navigation features of
many of the world’s largest internet sites.

Solr is written in Java and runs as a standalone full-text search server within a servlet container such as Tomcat. Solr uses the Lucene Java search library at its core for full-text indexing and search, and has REST-like HTTP/XML and JSON APIs that make it easy to use from virtually any programming language. Solr’s powerful external configuration allows it to be tailored to almost any type of application without Java coding, and it has an extensive plugin architecture when more advanced customization is required.

New Solr 1.4 features include

  • Major performance enhancements in indexing, searching, and faceting
  • Revamped all-Java index replication that’s simple to configure and can replicate configuration files
  • Greatly improved database integration via the DataImportHandler
  • Rich document processing (Word, PDF, HTML) via Apache Tika
  • Dynamic search results clustering via Carrot2
  • Multi-select faceting (support for multiple items in a single category to be selected)
  • Many powerful query enhancements, including ranges over arbitrary functions, and nested queries of different syntaxes
  • Many other plugins including Terms for auto-suggest, Statistics, TermVectors, Deduplication
Performance Enhancements
  1. A simple FieldCache load test
  2. Filtered query performance increases
  3. Solr scalability improvements
  4. Solr faceted search performance improvements
  5. Improvements in Solr Faceting Search
Revamped All-Java Replication
  1. SolrReplication wiki page
  2. Works on Microsoft Windows Platforms too!
DataImportHandler improvements
  1. What’s new in DataImportHandler in Solr
  2. DataImportHandler wiki page
Rich document processing
  1. ExtractingRequestHandler Wiki page
  2. Posting Rich Documents to Apache Solr using SolrJ and Solr Cell
Dynamic Search Results Clustering
  1. ClusteringComponent Wiki page
  2. Solr’s new Clustering Capabilities
Multi-select Faceting
  1. Local params for faceting
  2. Tagging and excluding filters
Query Enhancements
  1. Ranges over functions
  2. Nested query support for any type of query parser (via QParserPlugin). Quotes will often be necessary to encapsulate the nested query if it contains reserved characters. Example: _query_:”{!dismax qf=myfield}how now brown cow”
New Plugins
  1. TermsComponent (can be used for auto-suggest)
  2. TermVectorComponent
  3. Statistics
  4. Deduplication
SolrJ - Java client
  1. Faster, more efficient Binary Update format
  2. Javabean (POJO) binding support
  3. Fast multi-threaded updates through StreamingUpdateSolrServer
  4. Simple round-robin load balancing client - LBHttpSolrServer
  5. Stream documents through an Iterator API
  6. Many performance optimizations
Miscellaneous
  1. Rollback command in UpdateHandler
  2. More configurable logging through the use of SLF4J library
  3. ‘commitWithin’ parameter on add document command allows setting a per-request auto-commit time limit.
  4. TokenFilter factories for Arabic language
  5. Improved Thai language tokenization (SOLR-1078)
  6. Merge multiple indexes
  7. Expunge Deletes command
Upgrade instructions

Although Solr 1.4 is backwards-compatible with previous releases, users are encouraged to read the upgrading notes in the Solr Change Log.

There are so many more new features, optimizations, bug fixes and refactorings that it is not possible to cover them all in a single blog post.

A large amount of effort has gone into this release. Many congratulations to the entire Solr community for making this happen!

Great things are planned for the next release and it is a great time to get involved. See http://wiki.apache.org/solr/HowToContribute for how to get started.

Enjoy Solr 1.4 and let us know on the mailing lists if you have any questions!

DataImportHandler is a Apache Solr module that provides a configuration driven way to import data from databases, XML and other sources into Solr in both “full builds” and incremental delta imports.

A large number of new features have been introduced since it was introduced in Solr 1.3.0. Here’s a quick look at the major new features:

Error Handling & Rollback

Ability to control behavior on errors was an oft-request feature in DataImportHandler. With Solr 1.4, DataImportHandler provides configurable error handling options for each entity. You can specify the following as an attribute on the “entity” tag:

  1. onError=”abort” - Aborts the import process
  2. onError=”skip” - Skips the current document
  3. onError=”continue” - Continues as if the error never occurred
All errors are still logged regardless of the selected option. When an import aborts, either due to an error or a user command, all changes to the index since the last commit are rolled back.

Event Listeners

An API is exposed to write listeners for import start and end. A new interface called EventListener has been introduced which has a single method:

public void onEvent(Context ctx);

For example, the listener can be specified as:
<document onImportStart="com.foo.StartListener" onImportEnd="com.foo.EndListener">
Push data to Solr through DataImportHandler

In Solr 1.3, DataImportHandler was pull based only. If you wanted to push data to Solr e.g. through a HTTP POST request, you had no choice but to convert it to Solr’s update XML format or CSV format. That meant that all the DataImportHandler goodness was not available. With Solr 1.4, a new DataSource named ContentStreamDataSource allows one to push data to Solr through a regular POST request.

Suppose one wants to push the following XML to Solr and use DataImportHandler to parse and index:

<root>
<b>
<id>1</id>
<c>Hello C1</c>
</b>
<b>
<id>2</id>
<c>Hello C2</c>
</b>
</root>

We can use ContentStreamDataSource to read the XML pushed to Solr through HTTP POST:

<dataConfig>
<dataSource type="ContentStreamDataSource" name="c"/>
<document>
<entity name="b" dataSource="c" processor="XPathEntityProcessor"
forEach="/root/b">
<field column="desc" xpath="/root/b/c"/>
<field column="id" xpath="/root/b/id"/>
</entity>
</document>
</dataConfig>

More Power to Transformers

New flag variables have been added which can be emitted by custom Transformers to skip rows, delete documents or stop further transforms.

New DataSources
  • FieldReaderDataSource - Reads data from an entity’s field. This can be used, for example, to read XMLs stored in databases.
  • ContentStreamDataSource - Accept HTTP POST data in a content stream (described above)
New EntityProcessors
  • PlainTextEntityProcessor - Reads from any DataSource and outputs a String
  • MailEntityProcessor (experimental) - Indexes mails from POP/IMAP sources into a solr index. Since it required extra dependencies, it is available as a separate package called “solr-dataimporthandler-extras”.
  • LineEntityProcessor - Streams lines of text from a given file to be indexed directly or for processing with transformers and child entities.
New Transformers
  • HTMLStripTransformer - Strips HTML tags from input text using Solr’s HTMLStripCharFilter
  • ClobTransformer - Read strings from Clob types in databases.
  • LogTransformer - Log data in a given template format. Very useful for debugging.
Apart from the above new features, there have been numerous bug fixes, optimizations and refactorings. In particular:
  • Optimized defaults for database imports
  • Delta imports consume less memory
  • A ‘deltaImportQuery’ attribute has been introduced which is used for delta imports along with ‘deltaQuery’ instead of DataImportHandler manipulating the SQL itself (which was error-prone for complex queries). Using only ‘deltaQuery’ without a ‘deltaImportQuery’ is deprecated and will be removed in future releases.
  • The ‘where’ attribute has been deprecated in favor of ‘cacheKey’ and ‘cacheLookup’ attributes making CachedSqlEntityProcessor easier to understand and use.
  • Variables placed in DataSources, EntityProcessor and Transformer attributes are now resolved making very dynamic configurations possible.
  • JdbcDataSource can lookup javax.sql.DataSource using JNDI
  • A revamped EntityProcessor APIs for ease in creating custom EntityProcessors
There are many more changes, see the changelog for the complete list. There’s a new DIHQuickStart wiki page which can help you get started faster by providing cheat sheet solutions. Frequently asked questions along with their answers are recorded in the new DataImportHandlerFaq wiki page.

A big THANKS to all the contributors and users who have helped us by giving patches, suggestions and bug reports!

Future Roadmap

Once Solr 1.4 is released, there are a slew of features targeted for Solr 1.5, including:
  • Multi-threaded indexing
  • Integration with Solr Cell to import binary and/or structured documents such as Office, Word, PDF and other proprietary formats
  • DataImportHandler as an API which can be used for creating Lucene indexes (independent of Solr) and as a companion to Solrj (for true push support). It will also be possible to extend it for other document oriented, de-normalized data stores such as CouchDB.
  • Support for reading Gzipped files
  • Support for scheduling imports
  • Support for Callable statements (stored procedures)
If you have any feature requests or contributions in mind, do let us know on the solr-user mailing list.

Adoption of Apache Solr is accelerating. Being accessible though HTTP makes it possible for Solr (a Java webapp) to be used with any language. All you need is support for making HTTP calls and parsing one of the many available formats such as XML or JSON.

Drupal

Drupal is one of the most popular CMS available as open source. It is written in PHP and boasts of a huge user and developer base. Recently, the Drupal community has integrated Apache Solr into Drupal for vertical search. The integration is available as a Drupal module at http://drupal.org/project/apachesolr. There are some excellent tutorials available on how to get started with using this as well as a hosted solution by Acquia.

Ruby

Ruby integration has been present in Solr since a long time. There is a module called solr-ruby as well as acts_as_solr. Solr even has a ruby response writer which outputs search results serialized in ruby. Blacklight is an open source project I know that uses Solr and is built in Ruby. Today, I came to know about SunSpot - A Solr powered search engine for Ruby. More details at this article in LinuxMag.

Python

Solr has a python response writer as well as many clients. See http://wiki.apache.org/solr/SolPython for details. Reddit is one site that uses Solr with a python front-end application. There is also HayStack for Django which can use Solr among other engines such as Xapian and Whoosh.

Solr 1.4 is nearing release with a number of features and performance improvements. On the other hand, Lucene is getting ready for near real-time search as well. Things are getting interesting in the Solr world!

Multi-select faceting is a new feature in the, soon to be released, Solr 1.4. It introduces support for tagging and excluding filters which enables us to request facets on a super-set of results from Solr.

The Problem

Out-of-the-box support for faceted search is a very compelling enhancement that Solr provides on top of Lucene. I highly recommend reading through the excellent article by Yonik on faceted search at Lucid Imagination’s website, if you are not familiar with it.

Faceting on a field provides a list of (term,document-count) pairs for a given field. However, the returned facet results are always calculated on the current resultset. Therefore, whatever the current results are, the facets are always in sync with the results. This is both an advantage as well as a disadvantage.

Let us take the search UI for finding used vehicles on the Vast.com website. There are facets on the seller’s location and the vehicle’s model. Let us assume that the Solr query to show that page looks like the following:
q=chevrolet&facet=true&facet.field=location&facet.field=model&facet.mincount=1


What happens when you select a model by clicking on, say “Impala”? The facet for vehicle model disappears. Why? The reason is that now only “Impala” is being shown and there are no other models present in the current result set. The Solr query looks like the following now:
q=chevrolet&facet=true&facet.field=location&facet.field=model&facet.mincount=1&fq=model:Impala

So what is wrong with this? Nothing really. Except that for ease of navigation, you may still want to show all other models and document-counts which were being shown in the super-set of the current results (the previous page). But, as we noted a while back, the facets are shown for the current result set, in which all the models are Impala. If we attempt to facet on models field with the filter query applied, we will get a list of all models. But, except for “Impala”, all other models will have a zero document count.

Solution #1 - Make another Solr query

Make another call to Solr without the filter query to get the other values. Our example query would look like:
q=chevrolet&facet=true&facet.field==model&facet.mincount=1&rows=0
The rows=0 is specified because we don’t really want the actual results, just the facets for the model field. This is a solution that can be used with any version of Solr. However, it is one additional HTTP request. Even though it is a bit inconvenient, this is usually fast enough. However, an additional call is expensive if you are using Solr’s Distributed Search which will send one or more queries to each shard.

Solution #2 - Tag and exclude filters

This is where multi-select faceting support comes in handy. With Solr 1.4, it is possible to tag the filter queries with a name. Then we can exclude one or more tagged queries when requesting for facets. All of this happens through additional metadata that is added to request parameters through a syntax called Local Params.

Let us go step-by-step and change the query in the above example and see how the request to Solr will look like.

1. The original request in the above example without tagging:
q=chevrolet&facet=true&facet.field=location&facet.field=model&facet.mincount=1&fq=model:Impala
2. The filter query tagged with ‘impala’:
q=chevrolet&facet=true&facet.field=location&facet.field=model&facet.mincount=1&fq={!tag=impala}model:Impala
3. The facet field with the ‘impala’ filter query excluded:
q=chevrolet&facet=true&facet.field=location&facet.field={!ex=impala}model&facet.mincount=1&fq={!tag=impala}model:Impala
Now, with this one query, you can get the facets for current results as well as for the super-set without the need to make another call to Solr. If you want Solr to return this particular facet field under an alternate name, you can add a ‘key=alternative-name’ local param. For example, the following Solr query will return the ‘models’ facet under the name of ‘allModels’:
q=chevrolet&facet=true&facet.field=location&facet.field={!ex=impala key=allModels}model&facet.mincount=1&fq={!tag=impala}model:Impala
Tagging, filtering and renaming is not just limited to facet fields. It can be used with facet queries, facet prefixes and date faceting too.

This is another cool contribution by Yonik (also see my previous post). I’m really looking forward to the Solr 1.4 release. It is bringing a bunch of very useful features including the super-easy-to-setup Java based replication. But more on that in a later post.

Yonik Seeley recently implemented a new method for faceting which will be available in Solr 1.4 (yet to be released). It is optimized for faceting on multi-valued fields with large number of unique terms but relatively low number of terms per document. The new method has made a large improvement in performance for faceted search and has cut memory usage at the same time.

Background

When you facet on a field, Solr gets the list of terms in a field across all documents, executes a Filter Query for each term, caches the set of documents matching the filter, intersects it against the current result set and gives the count of documents matched for each term after the intersection. This works great for fields which have few unique values. However, it requires a large amount of memory and time when the field has a large number of unique values.

UnInvertedField

The new method uses an UnInvertedField data structure. In very basic terms, for each document, it maintains a list of term numbers that are contained in that document. There is some pre-computation involved in building up this data structure, which is done lazily for each field, when needed. If a term is contained in too many documents, it is not un-inverted. In this new method, when you facet on a field, Solr iterates through all the documents, summing up the number of occurrences of each term. The terms which were skipped while building the data structure use the older set intersection method during faceting.

This data structure is very well optimized. It doesn’t really store the actual terms (string). Each term number is encoded as a variable-length delta from the previous term number. A TermIndex is used to convert term numbers into the actual value for only those terms which are needed after faceting is completed (the current page of facet results). The concept is simple but if not implemented in an efficient way, it may impair performance rather than improve it. Therefore, there are a lot of important optimizations in the code.

Performance

Yonik benchmarked the performance of the new method against the old and his tests show a lot of improvement in faceting performance, sometimes by an order of magnitude (upto 50x). The improvement is much more significant as the number of unique tokens are increased.

For a comprehensive performance study, see the comment on the Jira issue about performance here and the document here.

There are a few ideas in the code comments which give directions on possible future optimizations. But the improvement from the old method are already quite massive, probably the law of diminishing returns will hold true here.

The structure is thrown away and re-created lazily on a commit. There might be a few concerns around the garbage accumulated by the (re)-creation of the many arrays needed for this structure. However, the performance gain is significant enough to warrant the trade-off.

Conclusion

The new method has been made the default method for faceting on non-boolean fields in the trunk code. It will be released with Solr 1.4 but it is already available in the trunk and nightly builds. If you are comfortable using the nightly builds, you are welcome to try it out.

A new request parameter has been introduced to switch to the old method if needed. Use facet.method=fc for the new method (default) and facet.method=enum for the old one.

Note - “Inside Solr” is a new feature that I hope to write regularly. It is intended to give updates about new features or improvements in Solr and at the same time, to describe the implementation details in a simple way. I invite you to give feedback through comments and tell me about what you would want to read about Solr.