Matching is a struggle, not with Elasticsearch

by Greg Dolder, Principal Architect

Project Brief

There is no planet B. So when our sustainability focused client asked us for help processing data resulting in more energy efficient purchases, reduced carbon footprint and energy grid load. We jumped at the opportunity.

The Problem

Matching and reconciling ERP data sold by dozens of HVAC/Food service/Lighting distributors nationally is no easy feat.

While many distributors sell the same or similar equipment, they tend to use their own naming conventions and formatting when it comes to models, product numbers, SKUs and more.

This posed a substantial hurdle as all products must resolve to industry standard identifiers like Energy Star and AHRI. The client's initial approach to matching was regular expressions (regex), postgresql pattern matching and similarity queries. While these yielded a decent level of success, it was not near the yield of matching necessary for success.

It was time to look for an alternate solution.

The Data

We initially spent time analyzing the incoming data from all disparate sources. The team determined that an increase in regular expressions, pattern matching and similarity queries would only result in minimal additional matching. Additionally, the combination and additions would be unbearable for the already lean team.

What about a search solution? Hello Elasticsearch.

The Solution

Our team installed, configured and indexed the AHRI and Energy Star data sources with Elasticsearch. We then were able to tweak the level of "fuzziness" to yield an almost 100% success rate in matching equipment to the proper certified ID from the respective certification authority.

Why not 100%?

Glad you asked. This was due to extraneous factors and data that did not match due to ERP configurations that could not be matched. For example, the ERP system would be providing a US Code as the model, but the certification authority utilizes the manufacturer international sku and at a string level, these were entirely different like trying to match ABC to XYZ. These cases were only able to be matched via manufacturer documentation or website and some good old-fashioned human intervention. We then indexed these into a manual-match index that complimented the equipment index achieving 100% match with the combination.

This resulted in dramatic reduction in resources both human and electronic, lowering the client's carbon footprint while supporting their ethos of sustainability on the only planet we have.

Conclusion

Elasticsearch is like having a super-smart librarian who can quickly find exactly what book you need in a huge library, while regular expressions are like using a basic flashlight to search for the book yourself.

With Elasticsearch, you get a powerful system that understands the context and relevancy of your data (even when dissimilar), making it much faster, more efficient and accurate way to find what you're looking for (or matching to), especially when dealing with massive amounts of information.

Need help matching your data for finding exactly what you're looking for? We're here to help!

More articles

The Trillion Row Challenge: Comparing AWS Serverless Big Data Platforms

We compare the performance and cost of serverless big data platforms for processing a trillion-row dataset.

Read more

Swimming In Bad Data

When a global leader in the pool & wellness industry asks you for help, you dive right in. What started out as a cloud migration turned into $2M in annual savings on warranty service that more than paid for a scalable and secure data solution that will last for decades.

Read more

Tell us about your project

Our office

  • productOps
    110 Cooper Street
    Suite 201
    Santa Cruz, CA 95060