From the mailing list announcement:
Apache Solr 1.4.1 has been released and is now available for public
download!
http://www.apache.org/dyn/closer.cgi/lucene/solr/
Solr 1.4.1 is a bug fix release. See the change log for more details.
From the mailing list announcement:
Apache Solr 1.4.1 has been released and is now available for public
download!
http://www.apache.org/dyn/closer.cgi/lucene/solr/
Solr 1.4.1 is a bug fix release. See the change log for more details.
A couple of weeks back, Apache Lucene Committer and PMC member, Michael McCandless started a discussion on factoring out a shared, standalone Analysis package for Lucene, Solr and Nutch. During the discussions, Yonik Seeley, Solr Creator, proposed merging the development of Lucene and Solr. After intense discussions and multiple rounds of voting, the following changes are being put into effect:
The following things do not change:
So what does it mean for Lucene/Solr users? Nothing much, really. Except that you should see tighter co-ordination between Lucene and Solr development. New Lucene features should reach Solr faster and releases should be more frequent. Solr features may also be made available to Lucene users who do not want to setup Solr use the RESTy APIs.
Already, Solr has been upgraded to use Lucene trunk (in branches/solr) and should soon become the new Solr trunk. There is talk of re-organizing the source structure to better fit the new model. Things are moving fast!
Personally, I feel that this merge is a good thing for both Lucene and Solr:
There are a couple of things to be worked out. For example, we need to decide where the integrated sources should live and whether or not to sync Solr’s version with Lucene’s. All this will take some time but I am confident that our combined community will manage the transition well.
According to a blog post from Microsoft Distinguished Engineer and CTO, FAST Bjørn Olstad, the 2010 products will be the last to have a search core that runs on Linux and UNIX.
Being involved in Apache Solr and the newly formed Lucene Connectors Framework (LCF) project, I’m very interested in the implications. Undoubtedly, at least some FAST customers will not be happy with this decision and will want to switch to something which can still run on Linux/UNIX.
I believe that this is a great opportunity for the Apache Solr/LCF combo. Perhaps, the newly proposed Apache Spatial Information Systems (SIS) will help as well. Of course, this is big news for Lucid Imagination, Sematext and other companies as well who offer consultancy, training and support for Lucene/Solr.
I’d like to ask people who have used FAST in the past, what would it take for Lucene/Solr/LCF to fill the gap?
Well, the cat is out of the bag. I’ve been working with Otis on Solr In Action. We’re looking for a couple of contributors to write case studies for the book describing how they have used Solr. Otis just posted this to his blog and to the Solr mailing list as well.
So, if you are are using Apache Solr in some clever, interesting or unusual way, or deal with large indexes or large number of cores or distributed search and are willing to share this information with the world, please get in touch. We are looking for between five to ten pages (soft limits) per case study.
You can contact either me or Otis by leaving a comment on this post with your contact info or contact @shalinmangar or @otisg on Twitter or email me on shalin at apache and we’ll get back to you right away.
It’d be great if you can share this message around too.
Erik has written about Solr’s usage in libraries on the Lucid Imagination Blog. Solr has found its way into many libraries and quite rightly so. However, one of the main things that Erik talks about in that blog post is the performance of DataImportHandler vs SolrMarc (the indexing library used by both VUFind and Blacklight).
Quoting from Erik’s email to the solrmarc-tech google group:
The difference in speeds:
SolrMarc: 22 docs / s
DIHmarc: 1,745 docs / s
W00t!
Well, I don’t know much about SolrMarc but I’ve seen DataImportHandler instances with comparable (or even better) throughput many times. There is something fishy inside SolrMarc for sure and I have a feeling that fixing it would be a low hanging fruit. However, Erik’s opinion is that DataImportHandler is a better way to index and that he will devote a portion of his time to helping the Solr using library community (thanks Erik!).
DIH has really taken off since it debuted in Solr 1.3 and it would be safe to call it the de-facto standard for indexing data into Solr. It may not be the most elegant way to index data but it is quick and it works great. With the planned features for DataImportHandler in Solr v1.5, it will continue to improve and it makes sense to base VUFind and Blacklight’s indexing infrastructure on top of it. I’m very excited to see this happening.

From the official announcement:
Apache Solr 1.4 has been released and is now available for public download!
http://www.apache.org/dyn/
Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document (e.g., Word, PDF) handling. Solr is highly scalable, providing distributed search and index replication, and it powers the search and navigation features of
many of the world’s largest internet sites.
Solr is written in Java and runs as a standalone full-text search server within a servlet container such as Tomcat. Solr uses the Lucene Java search library at its core for full-text indexing and search, and has REST-like HTTP/XML and JSON APIs that make it easy to use from virtually any programming language. Solr’s powerful external configuration allows it to be tailored to almost any type of application without Java coding, and it has an extensive plugin architecture when more advanced customization is required.
New Solr 1.4 features include
DataImportHandler is a Apache Solr module that provides a configuration driven way to import data from databases, XML and other sources into Solr in both “full builds” and incremental delta imports.
A large number of new features have been introduced since it was introduced in Solr 1.3.0. Here’s a quick look at the major new features:
Error Handling & Rollback
Ability to control behavior on errors was an oft-request feature in DataImportHandler. With Solr 1.4, DataImportHandler provides configurable error handling options for each entity. You can specify the following as an attribute on the “entity”
All errors are still logged regardless of the selected option. When an import aborts, either due to an error or a user command, all changes to the index since the last commit are rolled back.
Event Listeners
An API is exposed to write listeners for import start and end. A new interface called EventListener has been introduced which has a single method:
public void onEvent(Context ctx);
For example, the listener can be specified as:<document onImportStart="com.foo.StartListener" onImportEnd="com.foo.EndListener">
Push data to Solr through DataImportHandler
In Solr 1.3, DataImportHandler was pull based only. If you wanted to push data to Solr e.g. through a HTTP POST request, you had no choice but to convert it to Solr’s update XML format or CSV format. That meant that all the DataImportHandler goodness was not available. With Solr 1.4, a new DataSource named ContentStreamDataSource allows one to push data to Solr through a regular POST request.
Suppose one wants to push the following XML to Solr and use DataImportHandler to parse and index:
<root>
<b>
<id>1</id>
<c>Hello C1</c>
</b>
<b>
<id>2</id>
<c>Hello C2</c>
</b>
</root>
We can use ContentStreamDataSource to read the XML pushed to Solr through HTTP POST:
<dataConfig>
<dataSource type="ContentStreamDataSource" name="c"/>
<document>
<entity name="b" dataSource="c" processor="XPathEntityProcessor"
forEach="/root/b">
<field column="desc" xpath="/root/b/c"/>
<field column="id" xpath="/root/b/id"/>
</entity>
</document>
</dataConfig>
More Power to Transformers
New flag variables have been added which can be emitted by custom Transformers to skip rows, delete documents or stop further transforms.
New DataSources
Adoption of Apache Solr is accelerating. Being accessible though HTTP makes it possible for Solr (a Java webapp) to be used with any language. All you need is support for making HTTP calls and parsing one of the many available formats such as XML or JSON.
Drupal
Drupal is one of the most popular CMS available as open source. It is written in PHP and boasts of a huge user and developer base. Recently, the Drupal community has integrated Apache Solr into Drupal for vertical search. The integration is available as a Drupal module at http://drupal.org/project/apachesolr. There are some excellent tutorials available on how to get started with using this as well as a hosted solution by Acquia.
Ruby
Ruby integration has been present in Solr since a long time. There is a module called solr-ruby as well as acts_as_solr. Solr even has a ruby response writer which outputs search results serialized in ruby. Blacklight is an open source project I know that uses Solr and is built in Ruby. Today, I came to know about SunSpot - A Solr powered search engine for Ruby. More details at this article in LinuxMag.
Python
Solr has a python response writer as well as many clients. See http://wiki.apache.org/solr/SolPython for details. Reddit is one site that uses Solr with a python front-end application. There is also HayStack for Django which can use Solr among other engines such as Xapian and Whoosh.
Solr 1.4 is nearing release with a number of features and performance improvements. On the other hand, Lucene is getting ready for near real-time search as well. Things are getting interesting in the Solr world!
Multi-select faceting is a new feature in the, soon to be released, Solr 1.4. It introduces support for tagging and excluding filters which enables us to request facets on a super-set of results from Solr.
q=chevrolet&facet=true&facet.field=location&facet.field=model&facet.mincount=1

q=chevrolet&facet=true&facet.field=location&facet.field=model&facet.mincount=1&fq=model:Impala

q=chevrolet&facet=true&facet.field==model&facet.mincount=1&rows=0The rows=0 is specified because we don’t really want the actual results, just the facets for the model field. This is a solution that can be used with any version of Solr. However, it is one additional HTTP request. Even though it is a bit inconvenient, this is usually fast enough. However, an additional call is expensive if you are using Solr’s Distributed Search which will send one or more queries to each shard.
q=chevrolet&facet=true&facet.field=location&facet.field=model&facet.mincount=1&fq=model:Impala2. The filter query tagged with ‘impala’:
q=chevrolet&facet=true&facet.field=location&facet.field=model&facet.mincount=1&fq={!tag=impala}model:Impala3. The facet field with the ‘impala’ filter query excluded:
q=chevrolet&facet=true&facet.field=location&facet.field={!ex=impala}model&facet.mincount=1&fq={!tag=impala}model:ImpalaNow, with this one query, you can get the facets for current results as well as for the super-set without the need to make another call to Solr. If you want Solr to return this particular facet field under an alternate name, you can add a ‘key=alternative-name’ local param. For example, the following Solr query will return the ‘models’ facet under the name of ‘allModels’:
q=chevrolet&facet=true&facet.field=location&facet.field={!ex=impala key=allModels}model&facet.mincount=1&fq={!tag=impala}model:ImpalaTagging, filtering and renaming is not just limited to facet fields. It can be used with facet queries, facet prefixes and date faceting too.
Yonik Seeley recently implemented a new method for faceting which will be available in Solr 1.4 (yet to be released). It is optimized for faceting on multi-valued fields with large number of unique terms but relatively low number of terms per document. The new method has made a large improvement in performance for faceted search and has cut memory usage at the same time.
Background
When you facet on a field, Solr gets the list of terms in a field across all documents, executes a Filter Query for each term, caches the set of documents matching the filter, intersects it against the current result set and gives the count of documents matched for each term after the intersection. This works great for fields which have few unique values. However, it requires a large amount of memory and time when the field has a large number of unique values.
UnInvertedField
The new method uses an UnInvertedField data structure. In very basic terms, for each document, it maintains a list of term numbers that are contained in that document. There is some pre-computation involved in building up this data structure, which is done lazily for each field, when needed. If a term is contained in too many documents, it is not un-inverted. In this new method, when you facet on a field, Solr iterates through all the documents, summing up the number of occurrences of each term. The terms which were skipped while building the data structure use the older set intersection method during faceting.
This data structure is very well optimized. It doesn’t really store the actual terms (string). Each term number is encoded as a variable-length delta from the previous term number. A TermIndex is used to convert term numbers into the actual value for only those terms which are needed after faceting is completed (the current page of facet results). The concept is simple but if not implemented in an efficient way, it may impair performance rather than improve it. Therefore, there are a lot of important optimizations in the code.
Performance
Yonik benchmarked the performance of the new method against the old and his tests show a lot of improvement in faceting performance, sometimes by an order of magnitude (upto 50x). The improvement is much more significant as the number of unique tokens are increased.
For a comprehensive performance study, see the comment on the Jira issue about performance here and the document here.
There are a few ideas in the code comments which give directions on possible future optimizations. But the improvement from the old method are already quite massive, probably the law of diminishing returns will hold true here.
The structure is thrown away and re-created lazily on a commit. There might be a few concerns around the garbage accumulated by the (re)-creation of the many arrays needed for this structure. However, the performance gain is significant enough to warrant the trade-off.
Conclusion
The new method has been made the default method for faceting on non-boolean fields in the trunk code. It will be released with Solr 1.4 but it is already available in the trunk and nightly builds. If you are comfortable using the nightly builds, you are welcome to try it out.
A new request parameter has been introduced to switch to the old method if needed. Use facet.method=fc for the new method (default) and facet.method=enum for the old one.
Note - “Inside Solr” is a new feature that I hope to write regularly. It is intended to give updates about new features or improvements in Solr and at the same time, to describe the implementation details in a simple way. I invite you to give feedback through comments and tell me about what you would want to read about Solr.