Itamar Syn-Hershko

This Week at Skills Matter: 03 – 07 March 2014

Here’s what’s coming up at Skills Matter this week!

peter-ledbrook-1000px

Monday:

Peter Ledbrook will be highlighting the importance of the build tool Gradle as well as helping us to appreciate the challenges of building software, even if you aren’t persuaded by Gradle itself. An interesting talk delivered by one of the core Grails commiters, developing with Grails since version 0.2.

Tuesday:

London Titanium will be joined by Louis Quaintance and Chris Gedrim, demonstrating how to quickly write and run unit tests utilising the power of Node.js, while taking a look into First Utility’s implementation of Continuous Integration and Continuous Delivery.

The Scala User Group are returning with talks based around Snowplow, an open source event analytics platform. The founder of Snowplow Analytics, Alex Dean, will be delivering a talk about building Snowplow on top of Scala and key libraries and frameworks. Featuring a number of lightning talks, this event is must-attend for all Scala enthusiasts!

Wednesday:

Jakub Korab will be discussing the impacts of common integration failures at this weeks London Java Community Meetup. He will be introducing the tools that Apache Camel gives you out of the box, that take away the burden of the plumbing, and which free you up to build rock-solid integrations. Jakub, the co-author of the Apache Camel Developer’s Cookbook, will also be officially launching the Cookbook which contains over 100 recipes for working with Apache Camel.

Join Itamar Syn-Hershko, a core developer of RavenDB, who will be talking about the design and development of applications that provide multi-lingual search and a run through Elastic Search and Lucene. He will explore the opportunities for using these powerful technologies, offering his experiences and skills to help you understand them more in-depth.

Thursday:

Women in Data are back and will be talking about AnnoMarket, the new cloud-based marketplace for text analytics services. A hands-on session delving into the platform at an introductory walk-through level, giving delegates the chance to run the process themselves and discuss the results.

London Java Community are here again with Nitsan Wakart for an in-depth examination of the optimisation process focused on the humble Single-Producer-Single-Consumer queue. Anyone with a passion for concurrency, performance or both should definitely attend this fantastic event.

Guest post: “Sometimes it just makes sense to stop fighting reality and use a set of tools that is more suited for [the] task”, Itamar Syn-Hershko

This is a guest post from Itamar Syn-Hershko – a core developer of RavenDB. Author of open-source projects like HebMorphNAppUpdate, and a committer to the Apache Lucene.NET project as well as an active participant of others, Itamar strongly believes in the power of open-source projects and the creativity they can bring to the table.

He will be giving an In The Brain Talk on Approaches to multi-lingual text search with Elasticsearch and Lucene on the 5th March in London. You can register for this free talk here.


ITAMAR-SYN-HERSHKOLet’s face it – not all the data we handle is easy to query. In fact, most of it is actually pretty tough to work with. This is oftentimes because a lot of the data we process and handle is unstructured. Be it logs, archived documents, user data, or text fields in our database that we know contain information that can be useful, but we just don’t know how to get to it.

As developers, we tend to fight that. Our first reaction will always be to try and structure the unstructured. This is the challenge we like to rise to as professionals, and that is truly great. But sometimes it just makes sense to stop fighting reality and use a set of tools that is more suited for this task. In some cases this will save many resources and hair-pulling. In other cases, I don’t know whether they are better or worse, we didn’t even realize we had a gold mine of information at our fingertips so we haven’t even tried doing something with it.

During the past 10 years or so the field of information retrieval – text retrieval and search engines in particular – has evolved greatly. Search engines have been built and scaled, and within a few years did the impossible. Nobody thought we could handle that scale of data, or to make sense out of it all. Would you have invested in Google before 2000?

Search engines do not exist only on 3rd party websites like Google or Bing. Quite a few search engine libraries that are meant to be used in both open- and closed-source projects were released under various licenses. The most notable of all is probably Apache Lucene, a search engine library released as open-source for the first time in 1999. Since then, Lucene has made giant steps and is developed actively to this day, making new landmarks every few months by releasing new features or major improvements.

But Lucene is just a search library. To scale it out so it can handle large amounts of data you need to have inter-server communications, and some logic to split your data between them. For that Lucene offers Solr, a search server that acts as a wrapper around Lucene indexes. Another option, created by other Lucene project members, is Elasticsearch. Both Solr and Elasticsearch are released under the same open-source license as Lucene’s, with my personal favorite being Elasticsearch, due to its novel approach for scaling out indexes and super-easy to use API (everything is doable using REST calls over HTTP).

Using these technologies (Lucene, Solr or Elasticsearch) it is very easy to add full-text search capabilities to any type of application – running on the desktop, web, cloud or mobile. There are a few things to figure out – like how to feed the data from your data sources, how to make sure the search engine has the last version of our data at all times, and how to process it correctly so common searches are effective and perform well. Every project has a different best practice to those challenges; they are hardly ever the same. But once you have figured those out, browsing your data is suddenly a breeze.”

As it turns out, full-text search capabilities are only the tip of the iceberg. As people started using search technologies to perform full-text searches, new capabilities came about. Leveraging the data and insights search engines can provide on our data, we can do a lot of interesting stuff. For example, we can detect typos and offer corrections; we can find similar documents so we can remove or merge them (also known as record linkage); or we can use this to offer customers at our shop similar products they can add to their cart.

Other, more advanced, modern usages of search technologies that are worth noting include geo-spatial search (using shapes like points, circles or polygons representing locations on Earth to find data tagged with more shapes; for example finding the nearest restaurant to the user’s location), image search by color scheme, entity extraction and other Natural Language Processing methods to further analyze texts and improve insights on them.

There is a great set of tools at our disposal when using search technologies, far more than we can even list in this blog post. Nowadays this is not only about full-text search anymore (although obviously this is definitely still supported and is better than ever before!). Being familiar with those tools and with the best practices for using them, we can start giving thought to how we can use them in our project – whether in an automated process or exposed via some UI to our users to give them (and us!) added value.

Modern search engines are built to be scalable and performant. With correct planning you can handle large amounts of data easily (even BigData, if you don’t mind the buzzword), as well as many concurrent users issuing many requests, by spreading your data across multiple servers. Because they are so performant, they can offer real-time search capabilities even on large sets of data. The most impressive use of this is most likely Elasticsearch’s Kibana dashboard to plot graphs in real-time out of an intensive stream of raw data, for example Apache HTTP server logs.

The field of search engines and information retrieval is moving ahead very fast. There are still many challenges to tackle, but there’s already a lot to gain from this quickly evolving set of technologies. Just a quick look at recent history will show you companies that were sold in billions not because they have a great product, but because they were able to collect a lot of data and extract insights out of it.


Itamar will be exploring and teaching this topic in depth in his 2-day course - available in both London and New York. In the course you will learn mostly about Elasticsearch, and by learning some of the theory behind it and then digging to its core you will understand how to use it correctly, how to bend it to your needs, and what is the set of tools that is at your disposal. The course is designed to provide you the tools to get you started with search technologies right away.

Find out more about the course here, or contact a member of the Skills Matter team via email or by calling +44 (0)207 183 9040.

The Year of the RavenDB @skillsmatter

ImageNOSQL has been gathering momentum at Skills Matter in recent years — with user groups springing up in MongoDB, Cassandra, Hadoop and Neo4j, expert sharing their skills in our In the Brain talks, and last year we hosted the first-ever NOSQL eXchange.

One of the most frequently asked questions before the NOSQL eXchange was “Why isn’t there a talk on RavenDB?”.

For the uninitiated, RavenDB is the transactional, open-source Document Database written in .NET, created by Oren Eini, aka Ayende Rahien, the man also behind Rhino Mocks and a leading figure in the open source NHibernate project.  Ayende Rahien gave the Skills Matter community An Introduction to RavenDB last year, but if you missed it you can watch the skillscast video here: http://skillsmatter.com/podcast/open-source-dot-net/nhibernate-1801/js-3616.

2012 is undoubtedly going to be the year of the Raven at Skills Matter — this is a database that is attracting a lot of attention, so we are thrilled to be working closely with Ayende to share it with the Skills Matter community.

Core RavenDB developer Itamar Syn-Hershko flies in from Israel on February 28 to give a free talk for the community, explaining RavenDB indexes.  Itamar will walk through the RavenDB indexing process, grok it, and we’re while at it master techniques with frightful names like Map/Reduce, MultiMap, Live Projections, Full text search, Boosting and more.  This is a great opportunity to learn from one of the best minds in RavenDB today.  If you’re not in London, or just can’t make it along, you can receive an email the day after Itamar’s talk with a link to where you can watch the skillscast video online.

If you’re keen to save time and effort building database-backed applications, you can now learn from the experts and get essential skills on using RavenDB at Skills Matter.  Itamar Syn-Hershko will be delivering Ayende Rahien’s RavenDB Workshop, starting on February 28 and running over two days.  If you are a Database Administrator, .NET Developer, Team Leader or Architect, join Itamar to learn to build a practical application which will demonstrate the all important data management patterns.

RavenDB is definitely going to be one to watch this year — and there will undoubtedly be more events at Skills Matter.  Are you signed up for our newsletters?  If not, take a look — they are updated with new topics and are another great way to hear about upcoming events.