General-purpose embedding models often struggle to extract specific domain features from input and represent them effectively in embeddings. This is because they lack guidance on what is important, leading to suboptimal results. Domain-critical aspects are not prioritised appropriately.
For instance, location information might be crucial in one domain but irrelevant in another.
Another important aspect is how relevant a feature is for a particular search query in that domain. A search query could contain multiple features, but only one could be that really matters.
For instance, “park in London with animals”. If we don’t have any documents that match exactly, should we prioritise “London”, “parks” or “animals”? The answer here again depends on what the end-user expects. I would imagine that a tourist searching for parks with animals in London would not expect to see a park in “Dallas” since it is not relevant at all.
Given our situation described above, this is not what the end-user (a tourist in London) would expect, even though the Dallas Zoo sentence is more similar in the vector space.
Common approaches to solving this issue would include fine-tuning models, using cross-encoders, or implementing hybrid search. Another flexible approach would be to use feature (entity) extraction in combination with multi-vectors.
Feature extraction using NER models
In our example above, the intention is to prioritise location. To do that, we first need to reliably extract location information from the search query and documents.
The NER model does extract the locations accurately, but it also does extract “Africa” as a location, which is not what we would like in this particular case. If a tourist is searching for something in Africa, we don’t want to show a zoo in Dallas.
Additionally, by default, the NER models have a limited set of feature (entity) categories, and it means, without fine-tuning, you might not have everything you need.
Feature extraction using LLMs
Another approach would be to use an LLM model to extract the important features.
As we can see, it solved the issue with related locations and focused only on the main location, which is exactly what we want.
Multi-vector approach
Now, using the extracted locations. Let’s see how they rank.
And now let’s combine the scores accordingly.
As we can see, we have achieved the desired result. Hyde Park in London would be the first result, followed by the Science Museum in London, with the Dallas Zoo coming in third with a combined score of 0.6655 (which is quite far from the first two results).
This is a simplified example, and the scores could also be adjusted using feature weights. For instance, the location could be boosted or reduced depending on the search query. We could also assign relative weights to each feature for the document itself and use them at the time of search to increase or decrease the final score.
One concern with this approach could be an LLM model latency at the search time, and this is a valid concern. If it's not acceptable, then an option would be to skip the feature extraction at the search time.
The scores still represent expected relevance for the location embeddings, but in some situations, this could lead to unexpected results. This is where fine-tuning a NER model for the search query feature extraction could be a good approach, and LLMs can help to generate a good training data set.
Final thoughts
The concept of defining a feature schema for your documents and also using it for your search queries is a simple and yet powerful approach to improve semantic search relevance. It can also be combined with hybrid search (even for filtering). It also enables the possibility to use per-field (feature) embedding models (if necessary).
I'm not a machine learning engineer, and this post is a result of my experience building semantic search systems. If you have any great suggestions, please let me know on x.com/aivisSilins
Thanks for reading!