Yandex’s Artificial Intelligence & Machine Learning Algorithms via @TaylorDanRW
Earlier this month, Google unveiled its latest AI algorithm, BERT, which is said to be the biggest Google update since RankBrain and affects 10% of all search queries.
BERT stands for bidirectional encoder representations from transformers. Transformers refer to models that process words in relation to all other words in a sentence, such as juxtaposition keywords, and synonyms.
However, Google’s artificial intelligence and machine learning algorithms aren’t the only ones being used by search engines globally.
Machine learning is a blanket term encompassing wide range algorithms that learn from datasets to provide:
It is widely used for a number of tasks, not only by search engines, but also:
- Music and film recommendations on streaming platforms.
- Energy usage predictions across states.
Search engines use this to process data from across the internet, and some offline sources in the case of Yandex, to provide better search results and experiences for users.
It’s been a decade since Yandex first introduced machine learning in search with the launch of Matrixnet.
The search engine has since gone on to improve its AI and ML capabilities with further updates including Palekh and Korolyov.
Matrixnet works by taking thousands of variables and “ranking factors” and assigning different weights to them based on:
- The user location.
- The search query.
- Established user intent(s)
This is done in order to return more relevant and accurate results to the user.
A tangible impact of Matrixnet was that for shorter queries with multiple common interpretations, non-commercial content began to feature more prominently within search results pages versus more commercial content (and commercial websites).
This is because the new core algorithm began to take into account the domain as a whole ecosystem, rather than individual pages and their immediate links.
During the same period that Yandex launched Matrixnet, the search engine also took measures to provide better results for users based on location. (There is no value for someone in Vladivostok being given local results for Moscow as it’s 113 hours by car!)
They did this through the Arzamas algorithm, which was superseded that year by Snezhinsk, and then in 2010 through Obinsk.
The latter enabled Yandex to better understand the region a website was based in, even if the webmasters hadn’t made the region declaration in Yandex Webmaster Tools.
This notably impacted websites with location doorway pages and local citation spam.
In 2016 (a year after RankBrain), Yandex introduced the Palekh algorithm. Palekh made use of deep neural networks to understand the meaning behind a search query better.
The algorithm uses neural networks to see the connections between a query and a document even if they don’t contain common words.
This technology is most useful for complex queries, such as finding movies by inaccurate descriptions of their plots.
Building on the Palekh algorithm, Yandex released the Korolyov update in August 2017.
According to Andrey Styskin, the Head of Yandex Search:
“Korolyov is able to match the meaning of a query with the meaning of pages, as opposed to the way Palekh used to work with headlines only. It also improves off the 150 pages Palekh was analyzing, by its ability to work with 200 000 pages at once.”
Similar to how RankBrain works, Korolyov becomes more efficient and accurate with each incremental data point it receives, and all results then feedback into the core algorithm, Matrixnet.
At the same time as the Korolyov announcement, Yandex also announced that Matrixnet had begun to:
- Take into account data from their crowdsourcing platform, Toloka (imagine a version Amazon’s Mechanical Turks).
- Process larger amounts of anonymized user data, to further improve and vary the data sets from which the machine learning algorithms were being exposed to.
Korolyov also introduced the notion of semantic (context) vectors within search, allowing it to perform a “meaning analysis” when a user submits a query. This enabled search to take into account the perceived meaning of all queries that led users to certain pages.
This meant that:
- During the indexing phase, each page was converted into semantic/context vectors.
- New queries could be understood more quickly and efficiently, with more accurate results, as to not provide a negative search experience.
In 2018, Yandex introduced the successor to the Matrixnet machine learning algorithm, CatBoost.
In comparison to Matrixnet, CatBoost (which is open-sourced) is capable of:
- More accurate predictions.
- Greater results diversification.
- Supporting variables that are non-numerical, such as the types of clouds, breeds of cats, and species of plants.
CatBoost utilizes the machine learning technique known as gradient boosting and typically resolves regression and classification problems – which manifest themselves visually as decision trees.
To date, CatBoost is also used outside of Yandex’s search engine by organizations such as Cloudflare and CERN.
It is utilized where gradient boosting on decision trees is required with reduced risk of overfitting, for tasks such as combatting bot-powered credential stuffing.
Optimizing for Yandex’s AI Algorithms
Yandex’s machine learning algorithms are only a small subset of the updates that the search engine has made over the years to tackle link spam and poor quality content, the same as Google.
Like with Google’s RankBrain (and now BERT) processes, there is no real way to directly optimize for machine learning algorithms as they take into account the web as a whole.
As ever, it’s important that you produce content that adds value to the user, matches search intent and is written in a natural language and for humans, not machines.