Latent semantic indexing (LSI) is an indexing and information retrieval method used to identify patterns in the relationships between terms and concepts.
With LSI, a mathematical technique is used to find semantically related terms within a collection of text (an index) where those relationships might otherwise be hidden (or latent).
And in that context, this sounds like it could be super important for SEO.
If you’ve heard rumblings about latent semantic indexing in SEO or been advised to use LSI keywords, you aren’t alone.
But will LSI actually help improve your search rankings? Let’s take a look.
The Claim: Latent Semantic Indexing As A Ranking Factor
The claim is simple: Optimizing web content using LSI keywords helps Google better understand it and you’ll be rewarded with higher rankings.
Backlinko defines LSI keywords in this way:
“LSI (Latent Semantic Indexing) Keywords are conceptually related terms that search engines use to deeply understand content on a webpage.”
By using contextually related terms, you can deepen Google’s understanding of your content. Or so the story goes.
That resource goes on to make some pretty compelling arguments for LSI keywords:
- “Google relies on LSI keywords to understand content at such a deep level.”
- “LSI Keywords are NOT synonyms. Instead, they’re terms that are closely tied to your target keyword.”
- “Google doesn’t ONLY bold terms that exactly match what you just searched for (in search results). They also bold words and phrases that are similar. Needless to say, these are LSI keywords that you want to sprinkle into your content.”
Does this practice of “sprinkling” terms closely related to your target keyword help improve your rankings via LSI?
The Evidence For LSI As A Ranking Factor
Relevance is identified as one of five key factors that help Google determine which result is the best answer for any given query.
As Google explains is its How Search Works resource:
“To return relevant results for your query, we first need to establish what information you’re looking forーthe intent behind your query.”
Once intent has been established:
“…algorithms analyze the content of webpages to assess whether the page contains information that might be relevant to what you are looking for.”
Google goes on to explain that the “most basic signal” of relevance is that the keywords used in the search query appear on the page. That makes sense – if you aren’t using the keywords the searcher is looking for, how could Google tell you’re the best answer?
Now, this is where some believe LSI comes into play.
If using keywords is a signal of relevance, using just the right keywords must be a stronger signal.
There are purpose-build tools dedicated to helping you find these LSI keywords, and believers in this tactic recommend using all kinds of other keyword research tactics to identify them, as well.
The Evidence Against LSI As A Ranking Factor
Google’s John Mueller has been crystal clear on this one:
“…we have no concept of LSI keywords. So that’s something you can completely ignore.”
There’s a healthy skepticism in SEO that Google may say things to lead us astray in order to protect the integrity of the algorithm. So let’s dig in here.
First, it’s important to understand what LSI is and where it came from.
Latent semantic structure emerged as a methodology for retrieving textual objects from files stored in a computer system in the late 1980s. As such, it’s an example of one of the earlier information retrieval (IR) concepts available to programmers.
As computer storage capacity improved and electronically available sets of data grew in size, it became more difficult to locate exactly what one was looking for in that collection.
Researchers described the problem they were trying to solve in a patent application filed September 15, 1988:
“Most systems still require a user or provider of information to specify explicit relationships and links between data objects or text objects, thereby making the systems tedious to use or to apply to large, heterogeneous computer information files whose content may be unfamiliar to the user.”
Keyword matching was being used in IR at the time, but its limitations were evident long before Google came along.
Too often, the words a person used to search for the information they sought were not exact matches for the words used in the indexed information.
There are two reasons for this:
- Synonymy: the diverse range of words used to describe a single object or idea results in relevant results being missed.
- Polysemy: the different meanings of a single word results in irrelevant results being retrieved.
These are still issues today, and you can imagine what a massive headache it is for Google.
However, the methodologies and technology Google uses to solve for relevance long ago moved on from LSI.
What LSI did was automatically create a “semantic space” for information retrieval.
As the patent explains, LSI treated this unreliability of association data as a statistical problem.
Without getting too into the weeds, these researchers essentially believed that there was a hidden underlying latent semantic structure they could tease out of word usage data.
Doing so would reveal the latent meaning and enable the system to bring back more relevant results – and only the most relevant results – even if there’s no exact keyword match.
Here’s what that LSI process actually looks like:
Image created by author, January 2022
And here’s the most important thing you should note about the above illustration of this methodology from the patent application: there are two separate processes happening.
First, the collection or index undergoes Latent Semantic Analysis.
Second, the query is analyzed and the already-processed index is then searched for similarities.
And that’s where the fundamental problem with LSI as a Google search ranking signal lies.
Google’s index is massive at hundreds of billions of pages, and it’s growing constantly.
Each time a user inputs a query, Google is sorting through its index in a fraction of a second to find the best answer.
Using the above methodology in the algorithm would require that Google:
- Recreate that semantic space using LSA across its entire index.
- Analyze the semantic meaning of the query.
- Find all similarities between the semantic meaning of the query and documents in the semantic space created from analyzing the entire index.
- Sort and rank those results.
That’s a gross oversimplification, but the point is that this isn’t a scalable process.
This would be super useful for small collections of information. It was helpful for surfacing relevant reports inside a company’s computerized archive of technical documentation, for example.
The patent application illustrates how LSI works using a collection of nine documents. That’s what it was designed to do. LSI is primitive in terms of computerized information retrieval.
Latent Semantic Indexing As A Ranking Factor: Our Verdict
While the underlying principles of eliminating noise by determining semantic relevance have surely informed developments in search ranking since LSA/LSI was patented, LSI itself has no useful application in SEO today.
It hasn’t been ruled out completely, but there is no evidence that Google has ever used LSI to rank results. And Google definitely isn’t using LSI or LSI keywords today to rank search results.
Those who recommend using LSI keywords are latching on to a concept they don’t quite understand in an effort to explain why the ways in which words are related (or not) is important in SEO.
Relevance and intent are foundational considerations in Google’s search ranking algorithm.
Those are two of the big questions they’re trying to solve for in surfacing the best answer for any query.
Synonymy and polysemy are still major challenges.
Semantics – that is, our understanding of the various meanings of words and how they’re related – is essential in producing more relevant search results.
But LSI has nothing to do with that.
Featured Image: Paulo Bobita/Search Engine Journal