Shalin Says...

Apache Mahout 0.3 has been released. Apache Mahout is a project which attempts to make machine learning both scalable and accessible. It is a sub-project of the excellent Apache Lucene project which provides open source search software.

From the project website:

The Apache Lucene project is pleased to announce the release of Apache Mahout 0.3. Highlights include:
  • New: math and collections modules based on the high performance Colt library
  • Faster Frequent Pattern Growth(FPGrowth) using FP-bonsai pruning
  • Parallel Dirichlet process clustering (model-based clustering algorithm)
  • Parallel co-occurrence based recommender
  • Parallel text document to vector conversion using LLR based ngram generation
  • Parallel Lanczos SVD (Singular Value Decomposition) solver
  • Shell scripts for easier running of algorithms, utilities and examples

.. and much much more: code cleanup, many bug fixes and performance improvements

Details on what’s included can be found in the release notes. Downloads are available from the Apache Mirrors

Congratulations and thanks to all Mahout developers!



Apache Mahout 0.2 has been released. Apache Mahout is a project which attempts to make machine learning both scalable and accessible. It is a sub-project of the excellent Apache Lucene project which provides open source search software.

From the project website:

The Apache Lucene project is pleased to announce the release of Apache Mahout 0.2.

Highlights include:

  • Significant performance increase (and API changes) in collaborative filtering engine
  • K-nearest-neighbor and SVD recommenders
  • Much code cleanup, bug fixing
  • Random forests, frequent pattern mining using parallel FP growth
  • Latent Dirichlet Allocation
  • Updates for Hadoop 0.20.x

Details on what’s included can be found in the release notes.

Downloads are available from the Apache Mirrors


Apache Mahout 0.1 has been released. Apache Mahout is a project which attempts to make machine learning both scalable and accessible. It is a sub-project of the excellent Apache Lucene project which provides open source search software.

This is also the first public release of Taste collaborative filtering project ever since it was donated to Apache Mahout last year.

From the official announce email:

The Apache Lucene project is pleased to announce the release of Apache Mahout 0.1. Apache Mahout is a subproject of Apache Lucene with the goal of delivering scalable machine learning algorithm implementations under the Apache license. The first public release includes implementations for clustering, classification, collaborative filtering and evolutionary programming.

Highlights include:
  1. Taste Collaborative Filtering
  2. Several distributed clustering implementations: k-Means, Fuzzy k-Means, Dirchlet, Mean-Shift and Canopy
  3. Distributed Naive Bayes and Complementary Naive Bayes classification implementations
  4. Distributed fitness function implementation for the Watchmaker evolutionary programming library
  5. Most implementations are built on top of Apache Hadoop (http://hadoop.apache.org) for scalability
Look at the announcement for more details - http://www.nabble.com/-ANNOUNCE—Apache-Mahout-0.1-Released-td22937220.html

There is a lot of interest in Mahout from the community and it had a successful year with the Google Summer of Code 2008 program. This year again, there have been multiple proposals and I’m sure that great things are on the way.

The Apache Mahout Wiki has a lot of good documentation on the project as well as on machine learning in general. Their mailing list is very active and of course, they have some great people involved, see the committers page. I would encourage every student interested in machine learning to participate in the project.

I wish good luck to the project and the people involved in it. Keep up the great work!

Google Summer of Code program is back again this year and Apache is looking for students interested in contributing and making money with the program.

Apache Software Foundation received quite a few students with excellent proposals who did a lot of great work last year. Take a look at the last year’s proposals to get a feel of the level of competition. I’m sure there would be quite a few this year as well. A wiki page has been put up which will list all the proposals.

You can come up with their own proposals as well and add it to the wiki. However, the ASF being a community driven eco-system, it is highly recommended that you drop a line to the project mailing lists and get feedback on your proposal. That way, you will have time to convince one or more committers to sign up as mentors for your proposal. They will help you develop your proposal as well as guide you along the project with regular reviews and feedback. If your proposal attracts no mentors, it cannot be accepted for the program.

Open Source is a different ball game than academic projects and the code itself is a small part. One needs to write unit tests to inspire confidence in the code before it can be incorporated in a project. If other developers are interested in your project, they’ll want to collaborate with you. With each patch, you’ll get review comments which you may need to incorporate. There are very few places, if any, where you can get such great feedback on your work and that too, absolutely free.

Users will need documentation and tutorials about your code before they can start using it. Sometimes, one also needs to create working examples to demonstrate usage and features. Users will ask questions on your features, post bug reports and suggest enhancements. It is the open source way to courteously answer them and guide them to solutions. As the feature matures, the community also benefits from best practices, FAQs and guidelines on performance optimization. Ultimately, it is well worth the effort to learn the open source way of developing software.

I’ve been thinking about a few features which can help Solr but more on that later. For now, see the announcement on solr-dev mailing list on GSOC 2009 and reply with your ideas if you are interested.

Grant has also written a useful post with advice to aspring GSOC participants on his blog.