Shalin Says...
Saw a Kurkure advertisement on a website titled “No plastic in Kurkure”. ROTFL!

Saw a Kurkure advertisement on a website titled “No plastic in Kurkure”. ROTFL!

Well, the cat is out of the bag. I’ve been working with Otis on Solr In Action. We’re looking for a couple of contributors to write case studies for the book describing how they have used Solr. Otis just posted this to his blog and to the Solr mailing list as well.

So, if you are are using Apache Solr in some clever, interesting or unusual way, or deal with large indexes or large number of cores or distributed search and are willing to share this information with the world, please get in touch. We are looking for between five to ten pages (soft limits) per case study.

You can contact either me or Otis by leaving a comment on this post with your contact info or contact @shalinmangar or @otisg on Twitter or email me on shalin at apache and we’ll get back to you right away.

It’d be great if you can share this message around too.

Amit Desh­pande and Dirk Riehle from SAP Labs have conducted and published a research on the growth of open source software.

The data has been culled from Ohloh.net and is based on the stats and activity of around 5000 open source projects written in 30 different languages and 103 open source licenses.

Some interesting quotes from the publication:

Suc­cess­ful open source projects like Linux, Apache, Post­greSQL and many oth­ers are grow­ing super-linearly. Pre­vi­ous research showed that lin­ear and qua­dratic growth is the dom­i­nant growth pat­tern of open source soft­ware projects

Our work shows that the addi­tions to open source projects, the total project size (mea­sured in source lines of code), the num­ber of new open source projects, and the total num­ber of open source projects are grow­ing at an expo­nen­tial rate. The total amount of source code and the total num­ber of projects dou­ble about every 14 months.

Open Source has taken off handsomely and continues to thrive. It is not just about philosophy any more, it is good business sense.

In case you are interested about Solr stats, see the Solr project page at Ohloh.

Update - The report is quite old but I just discovered it now :)

Erik has written about Solr’s usage in libraries on the Lucid Imagination Blog. Solr has found its way into many libraries and quite rightly so. However, one of the main things that Erik talks about in that blog post is the performance of DataImportHandler vs SolrMarc (the indexing library used by both VUFind and Blacklight).

Quoting from Erik’s email to the solrmarc-tech google group:

The difference in speeds:

SolrMarc: 22 docs / s
DIHmarc: 1,745 docs / s

W00t!

Well, I don’t know much about SolrMarc but I’ve seen DataImportHandler instances with comparable (or even better) throughput many times. There is something fishy inside SolrMarc for sure and I have a feeling that fixing it would be a low hanging fruit. However, Erik’s opinion is that DataImportHandler is a better way to index and that he will devote a portion of his time to helping the Solr using library community (thanks Erik!).

DIH has really taken off since it debuted in Solr 1.3 and it would be safe to call it the de-facto standard for indexing data into Solr. It may not be the most elegant way to index data but it is quick and it works great. With the planned features for DataImportHandler in Solr v1.5, it will continue to improve and it makes sense to base VUFind and Blacklight’s indexing infrastructure on top of it. I’m very excited to see this happening.

Opera published a study titled State of the Mobile Web, November 2009 which I found through TechCrunch. I can’t help but notice the tremendous growth in web usage through mobile phones in India. Page views have grown by 228.5% Y/Y and unique users have grown by 208.4% but if you look at metrics like page views per user or the amount of data transferred per user, you’ll see that they are quite small.

One of the reason is that in India we don’t have 3G (actually we do but it is limited to only one provider - BSNL) and browsing the Internet is painfully slow on GPRS connections. With 3G finally coming next year, I’m quite sure that the mobile web usage in India will just explode.

Right now the only kind of mobile applications that can work in India are SMS based. The increase in download speeds will make more people use the Internet and SMS will be less relevant, though that may take more time than a year. If 3G indeed is introduced by mid next year by prominent providers like Airtel and Vodaphone, I won’t be surprised to see Y/Y growth rate exceeding 500%.

Companies building products/services around SMS should start thinking about a mobile web strategy now.

I had been thinking about moving away from Blogger to my own domain. Finally, I decided to give in and I was fortunate enough to buy this domain. Blogger has been a simple service but I wanted to try the new kids Tumblr or Posterous. After spending some time fiddling with both of them, I decided to go with Tumblr.

It took me some time to figure out the right way to move from Blogger. I used the import script written by Jonathan Tron to import my old posts. Sadly, there is no way to import comments from blogger. The least I could do was to import them into disqus and link the same disqus account into tumblr; which actually does not help much.

The bigger issue was to migrate without leaving RSS subscribers in the lurch. The slightly lesser issue was to preserve my earlier blog’s Google page rank.

The first issue was solved easily. You can do the same with the following sequence of steps.

  1. Create a Feedburner account and import your blogger feed.
  2. Set your blogger feed to redirect to Feedburner.
  3. Edit your tumblr theme and replace the RSS link to point to feedburner
  4. Edit your feedburner to import from tumblr’s rss

The second was slightly more tricky. The right way to move a website to a different domain is to use permanent redirects from the old page to the new page. However, Blogger (obviously) does not allow you to do that. Thankfully, Google recently announced support for specifying  canonical links which can point to the preferred version of a URL. So, I hacked up a script to match pages of blogger’s RSS with tumblr’s RSS and generate conditional blogger template snippets which let me specify the canonical (tumblr) URL for each page on my Blogger account.

I couldn’t redirect so I had to fall back on meta-refresh to redirect anyone visiting an old page to the new page. I hate to break the back button like this but that was the only possible way in this case. This is what it looked like:

I used to think that blogging is a solved problem. After doing all this and trying out many service, I don’t believe that anymore.

AOL’s new logos, which one do you like?

AOL listed on the New York Stock Exchange on 10th December 2009. This has been in the works for a long time and I’m glad we’re finally here. Things are changing around the company and I’m happy to be a part of this change.

AOL has a new logo (and yes, it is still to be written as AOL). I loved the new brand videos, watch them on youtube - http://www.youtube.com/watch?v=YlSL7svbooY

Seed.com was also launched a few days back. It is a new spin on crowd sourcing content which, I believe, is a great idea.

We’re just getting started!


Apache Lucene Java 3.0.0 has been released. Lucene Java 3.0.0 is mostly a clean-up release without any new features. It paves the path for refactoring and adding new features without the shackles of backwards compatibility. All APIs deprecated in Lucene 2.9 have been removed and Lucene Java has officially moved to Java 5 as the minimum requirement.

See the announcement email for more details. Congratulations Lucene Devs!



Apache Mahout 0.2 has been released. Apache Mahout is a project which attempts to make machine learning both scalable and accessible. It is a sub-project of the excellent Apache Lucene project which provides open source search software.

From the project website:

The Apache Lucene project is pleased to announce the release of Apache Mahout 0.2.

Highlights include:

  • Significant performance increase (and API changes) in collaborative filtering engine
  • K-nearest-neighbor and SVD recommenders
  • Much code cleanup, bug fixing
  • Random forests, frequent pattern mining using parallel FP growth
  • Latent Dirichlet Allocation
  • Updates for Hadoop 0.20.x

Details on what’s included can be found in the release notes.

Downloads are available from the Apache Mirrors