In the deep learning framework, the application of embedding technology is very extensive, especially in the internet industries with recommendation, advertising, and search as the main business use case. It is not an exaggeration to call the embedding technique the “basic core operation” of deep learning.
The embedding operation has been mentioned many times in the previous chapters. Its main function is to convert sparse vectors into dense vectors for further processing by the upper-layer deep neural network. However, the role of embedding technology is far more than that. Its application scenarios are very diverse, and the implementation methods are also different.
In academia, embedding technology itself, as a popular direction in the field of deep learning research, has experienced a rapid evolution from processing sequence samples, to processing graph samples, and then to processing heterogeneous multifeature samples. In industry, embedding technology has almost become the most widely used deep learning technology due to its ability to integrate information and easy online deployment. The introduction to embedding technology in this chapter will focus on the following aspects:
(1) Introduction to the basics of the embedding concept;
(2) Introduction to the evolution of embedding methods from the classic Word2vec, to the popular graph embedding, and then to the multifeature fusion embedding technology;
(3) Introduction to the specific application of embedding technology in recommender systems and the method of online deployment and fast inferencing.
4.1 What Is Embedding?
Generally speaking, embedding uses a low-dimensional dense vector to represent an object. The object here can be a word, a product, a movie, and so on. The meaning of “represents” is that the embedding vector can express some characteristics of the corresponding object, and the distance between two vectors reflects the similarity between two objects.
4.1.1 Examples of Word Vectors
The popularity of the embedding method started from research on the problem of word vectors in the field of natural language processing. Here we take the word vector as an example to further explain the meaning of embedding.
Figure 4.1(a) shows the mappings of the embedding vectors of several words (with implicit relationships on genders) encoded by the Word2vec method in the embedding space. It can be seen that the distance vector from Embedding(king) to Embedding(queen) is parallel with that from Embedding(man) to Embedding(woman). This example indicates that the operation between word embedding vectors can even contain semantic-relationship information between words. Similarly, the part-of-speech example shown in Figure 4.1(b) also reflects this feature of word vectors. The distance vectors from Embedding(walking) to Embedding(walked) and Embedding(swimming) to Embedding(swam) are similar, which indicates that the part-of-speech relationship between walking–walked and swimming–swam is similar.

(a) Male–female.

(b) Part of speech.

(c) Country–capital.
Figure 4.1 Examples of word vectors.
Under the premise of a large amount of corpus input, embedding technology can even mine some more general knowledge. As shown in Figure 4.1(c), This shows that the operation between embeddings can mine general relational knowledge like “capital-country.”
From these examples, it is clear that in the word vector space, even if the word vector is not known at all, it can still be inferred by the semantic relationship and the word vector operation. This is how embedding describes the items in a specific vector space and at the same time reveals the potential relationship between items. In a sense, the embedding method even has an ontological and philosophical significance.
4.1.2 Expansion of Embedding Application in Other Fields
Now that embedding can vectorize words, it can also generate vectorized representations for the items in other application domains in some way.
For example, if embedding is applied to movie items, the distance between Embedding(The Avengers) and Embedding(Iron Man) should be very close in the embedding vector space, while the distance between Embedding(The Avengers) and Embedding(Gone with the Wind) will be relatively far.
In the same way, if the product is embedded in the e-commerce scenarios, the vector distance between Embedding(keyboard) and Embedding(mouse) should be relatively close, while the distance between Embedding(keyboard) and Embedding(hat) will be relatively far.
Unlike word vectors that use a large text corpus for training, the training samples in different fields are different. For example, video recommendation often uses the user’s streaming sequence to embed movies, while e-commerce platforms use the user’s purchase history as training samples.
4.1.3 Importance of Embedding Technology for Deep Learning Recommender Systems
Going back to the deep learning recommender system, why is embedding technology so important for model learning, or even the “basic core operation” of deep learning recommender systems? There are three main reasons:
(1) In the recommendation applications, one-hot encoding is widely used to encode categorical and ID-type features, resulting in extremely sparse feature vectors. Plus, the structural characteristics of deep learning make it unfavorable for the processing of sparse feature vectors. Therefore, almost all deep learning recommendation models include the embedding layer, which is responsible for converting high-dimensional sparse feature vectors into low-dimensional dense feature vectors. Therefore, mastering various embedding technologies is the basic step for building a deep learning recommendation model.
(2) Embedding itself is an extremely important feature vector. Compared with the feature vectors generated by traditional methods such as MF, embedding has stronger expressivity. After the graph embedding technology is introduced, embedding can introduce almost any information for encoding, so that it contains a lot of valuable information. On this basis, the embedding vector is often connected with other recommender systems’ features and then fed into the subsequent deep learning network for training.
(3) Utilizing embeddings to calculate the similarity between an item and a user is a commonly adopted approach in the retrieval layer technology for recommender systems. After fast nearest-neighbor search techniques such as locality-sensitive hashing are applied to the recommender systems, embedding is more suitable for rapid “preliminary screening” of massive candidate items, filtering out hundreds to thousands of items. The filtered candidate list is then handed over to the deep learning network for finer ranking.
Therefore, embedding technology plays an extremely important role in deep learning recommender systems. Familiarity with and mastering various popular embedding methods is a powerful tool for building successful deep learning recommender systems.
4.2 Word2vec: The Classic Embedding Method
When it comes to embedding, we have to mention Word2vec. This has made word vectors popular again in the field of natural language processing. But, more importantly, since Google proposed Word2vec in 2013 [Reference Mikolov1,Reference Mikolov2], embedding technology has been extended from the field of natural language processing to many other application fields of deep learning such as advertising, searching, recommendation, and so on. It has become an indispensable technique in the deep learning knowledge framework. As a result, being familiar with Word2vec is crucial to understanding all the embedding-related technologies and concepts.
4.2.1 What Is Word2vec?
Word2vec is short for “word to vector.” As the name suggests, Word2vec is a model that generates a vector representation for words.
In order to train the Word2vec model, a corpus consisting of a set of sentences needs to be prepared. Suppose one of the sentences of length is
, and assume that each word is closely related to its adjacent word, that is, each word is determined by the adjacent words (the main principle of the continuous bag of words (CBOW) model in Figure 4.2) or determines its adjacent words (the main principle of the Skip-gram model in Figure 4.2). As shown in Figure 4.2, the input of the CBOW model is the words around
, and the predicted output is
, while Skip-gram is the opposite. Empirically, Skip-gram works better; thus, we will use Skip-gram as the framework to explain the details of the Word2vec model in this section.

Figure 4.2 Structures of two Word2vec models (CBOW and Skip-gram).
4.2.2 Training Process of the Word2vec Model
In order to generate training samples for the model based on the corpus, we select a sliding window with a length of , which includes
words before and after the target word and target word itself. Then we extract a sentence from the corpus and move the sliding window from left to right. The phrases in the window then form a training sample.
With the training samples generated, it is time to define the optimization objective function. Given that each word determines the adjacent word
and based on the method of maximum likelihood estimation, the training process will maximize the product of the conditional probability
. With the logarithmic probability applied here, the objective function of Word2vec becomes as shown in Eq. 4.1.

The next main problem is how to define the conditional probability . As a multiclassification problem, the most direct method is to use the softmax function. The goal of Word2vec is to use a vector
to represent the word
and to use the inner product
of any two word vectors to represent the degree of semantic similarity. Then the definition of the conditional probability
can be intuitively given, as shown in Eq. 4.2,

where represents
, which is called the output word, and
represents
, which is called the input word.
Given this conditional probability formula, it is easy to ignore the fact that is used to predict
in the Word2vec Skip-gram model. However, the vector representation of these two vectors is not in the same vector space. As shown in the earlier conditional probability formula,
and
are the output vector representation and input vector representation of word w, respectively. So what are the input vector representation and the output vector representation? Here, the neural network structure diagram of Word2vec (as shown in Figure 4.3) is used for further explanation.

Figure 4.3 Neural network structure diagram of Word2vec.
According to the definition of conditional probability , the product of the two vectors can be put in the form of a softmax to convert it into the neural network structure, as shown in Figure 4.3. After the model architecture of Word2vec is represented by a neural network, the model parameters can be solved by gradient descent during the training process. The input vector representation is the weight matrix
from the input layer to the hidden layer, and the output vector representation is the weight matrix
from the hidden layer to the output layer.
After obtaining the input vector matrix , the weight vector corresponding to each row is the “word vector” in the general sense. So this weight matrix is naturally converted into a lookup table of Word2vec (as shown in Figure 4.4). For example, if the input vector is a one-hot vector composed of 10 000 words and the hidden layer dimension is 300, then the weight matrix from the input layer to the hidden layer has a dimension of
. After being converted into a word vector lookup table, the weight of each row becomes the embedding vector of the corresponding word.

Figure 4.4 Lookup table of Word2vec.
4.2.3 Negative Sampling in Word2vec Training
In fact, it is not very feasible to completely follow the training method as described in Section 4.2.2, which treats the original Word2vec as a multiclass structure. Assuming that the number of words in the corpus is 10 000, it means that there are 10 000 neurons in the output layer. When updating the weights from the hidden layer to the neurons in the output layer each iteration, the prediction errors of all 10 000 words in all dictionaries need to be calculated [Reference Rong3]. The training system can hardly bear such a huge amount of computation in the actual training process.
In order to reduce the training costs of Word2vec, negative sampling is often adopted. Compared with the original method, which calculates the prediction error of all words in all dictionaries, the negative sampling method only needs to calculate the prediction error for a few negative samples. In this case, the optimization objective of the Word2vec model degenerates from a multiclassification problem to an approximate binary classification problem [Reference Goldberg and Levy4], as shown in Eq. 4.3.

where is the output word vector for a positive sample,
is the hidden layer vector,
is the set of negative samples, and
is the negative sample word vector. Since the size of the negative sample set is very limited (usually less than 10 in practical applications), the computational complexity can be reduced by at least 1/1000 (assuming a vocabulary size of 10 000) in each iteration of gradient descent.
In fact, the hierarchical softmax method can also be used to speed up the training speed of Word2vec. But the implementation is more complicated, and the final performance is not significantly better than the negative sampling method, so it is less used. Interested readers can refer to Reference [Reference Rong3], which contains details of the hierarchical softmax derivation process.
4.2.4 Importance of Word2vec to Embedding Technology
Word2vec was officially proposed by Google in 2013. In fact, it did not entirely originate from the Google paper. The research on word vectors can be traced back to 2003 [Reference Bengio5], or even earlier. But it was Google’s successful application of Word2vec that allowed word vector technology to be rapidly promoted in the industry, and then made embedding a hot research topic. It is no exaggeration to say that Word2vec is of fundamental significance to research on embedding in the deep learning era.
From another perspective, the model structure, objective function, negative sampling method, and objective function in negative sampling proposed in the research of Word2vec have been reused and optimized many times in subsequent research. Mastering every detail in Word2vec has become the basis for studying embedding. In this sense, mastering the contents of this section is very important.
4.3 Item2vec: The Extension of Word2vec in Recommender Systems
After the birth of Word2vec, the idea of embedding quickly spread from the field of natural language processing to almost all fields of machine learning, one of which is recommender systems. Since Word2vec can embed the words in the word sequence, there should also be a corresponding embedding method for the user’s purchase sequence and streaming sequence. This is the basic idea of Item2vec model [Reference Barkan and Koenigstein6].
4.3.1 Fundamentals of Item2vec
As mentioned in the matrix factorization section (Section 2.3), the user latent vector and the item latent vector are generated through matrix decomposition. Viewing the matrix factorization model from the perspective of embedding, the user latent vector and the item latent vector are one type of user embedding vector and item embedding vector, respectively. Due to the popularity of Word2vec, more and more embedding methods can be directly used to generate item embedding vectors, while user embedding vectors are more often calculated by averaging or clustering item embeddings in the user’s action history. Using the similarity between the user vector and the item vector, the candidate set can be quickly obtained directly in the retrieval layer of the recommender system, or directly used in the ranking layer to get the final recommendation list. Following this idea, Microsoft proposed a method, Item2vec, to calculate the embedding vector of items in 2016.
Compared with Word2vec, which uses word sequence to generate the embedding vectors, Item2vec utilizes the sequence of actions from a user’s browsing, purchasing, and other histories.
Assuming that a sentence of length in Word2vec model is
, the loss function can be expressed as shown in Eq. 4.1. Similarly, assuming the user’s action sequence of length K is
, the loss function of Item2vec is then as follows,

By comparing the difference between Eqs. 4.1 and 4.4, it can be found that the only difference between Item2vec and Word2vec is that Item2vec abandons the concept of time window and considers that any two items in the sequence are related. Therefore, in the loss function of Item2vec (Eq. 4.4), the loss is the sum of the log probabilities of item pairs, instead of the log probabilities of items within the time window.
After the optimization target is defined, the remaining training process of Item2vec and the generation process of the final item embedding are consistent with Word2vec. The lookup table of the final item vector is analogical to the lookup table of the word vector in Word2vec. Readers can refer to the relevant content of Word2vec in Section 4.2 for more training details.
4.3.2 Item2vec in Generalized Form
In fact, the techniques of using embedding to vectorize items are much more than Item2vec. Generally speaking, any method capable of generating an item vector can be called Item2vec. A typical example is the two-tower model that has been successfully applied in companies such as Baidu and Facebook (as shown in Figure 4.5).

Figure 4.5 Two-tower model.
In the two-tower model depicted in Figure 4.5, the components on the advertising items side actually implement the generations of item embeddings. Given this model structure is called “two tower,” the structure on the advertising side is also referred to as the “item tower.” Then, the role of the “item tower” is essentially to generate the feature vector for an item. After the multilayer neural network structure in the item tower, a multidimensional dense vector is finally generated. From the perspective of embedding, this dense vector is actually the embedding vector of the item, but the embedding model has changed from Item2vec to a more complex but more flexible “item tower” model. The input features are one-hot encoded feature vectors based on the item list in user behavior sequences. It turns out to be a comprehensive item feature vector, which can contain more information. The ultimate purpose of both Item2vec and the “item tower” is to convert the original features of the item into a dense vector representation of the item embedding, so no matter what the model structure is, this kind of model can be called a “generalized” Item2vec class model.
4.3.3 Characteristics and Limitations of Item2vec
As an extension of the Word2vec model, Item2vec can theoretically use any sequence data to generate the embedding vector of an item, which greatly expands the application scenarios of Word2vec. The Item2vec model in a broad sense is actually a general term for item vectorization methods, which can use different deep learning network structures to embed the item features.
The Item2vec method also has its limitations. Since it can only use sequence data, Item2vec often meets constraints when dealing with graph structured data of internet, which is why graph embedding technology emerged.
4.4 Graph Embedding: Introducing More Structural Information
Word2vec and its derived Item2vec are the basic methods of embedding technology, but both are based on sequence samples (such as sentences and user action sequences). In the online use cases, it is more natural to represent the online data in the graph structures. A typical scenario is an item relationship graph generated from user behavior data (as shown in Figures 4.6(a) and (b)) and a knowledge graph composed of attributes and entities (as shown in Figure 4.6(c)).

(a) User behavior sequence.

(b) Item relation graph.

(c) Knowledge graph.
Figure 4.6 Item relationship graph and knowledge graph.
When faced with the graph structure, the traditional sequence embedding method seems powerless. In this context, graph embedding represents a new research direction and has gradually become more popular in the field of deep learning recommender systems.
Graph embedding is a method of encoding for nodes in a graph structure. The final node embedding vector generally contains the structural information of the graph as well as the local similarity information of nearby nodes. Different graph embedding methods have different principles and different ways of retaining graph information. The following introduces several state-of-the-art graph embedding methods, and their differences and similarities.
4.4.1 DeepWalk: The Basic Graph Embedding Method
In the early days, the most influential graph embedding method was DeepWalk [Reference Perozzi, Al-Rfou and Skiena7], which was proposed in 2014. Its main idea is to perform random walks on the graph structure composed of items to generate a large number of item sequences, then use these item sequences as the training samples fed into Word2vec for training, and finally get the item embeddings. Therefore, DeepWalk can be viewed as a transition method connecting sequence embedding and graph embedding. The algorithm flow of DeepWalk was adopted in the paper “Billion-scale commodity embedding for e-commerce recommender in Alibaba” [Reference Wang9], as shown in Figure 4.7.

(a) Users’ behavior sequences.

(b) Item graph construction.

(c) Random walk generation.

(d) Embedding with skip-gram.
Figure 4.7 The algorithm flow of DeepWalk.
The process of generating the item embeddings using the DeepWalk method is as follows:
(1) Collect original item sequences that users interact with (Figure 4.7(a));
(2) Build an item relational graph based on these user behavior history sequences, as shown in Figure 4.7(b). It can be seen that the edge between items A and B is because user U1 purchased item A and item B successively. If multiple identical directed edges are subsequently generated, the weights of the directed edges are accumulated. After all user behavior sequences have been converted into edges in the item relation graph, the global item relational graph is established;
(3) Randomly select the starting point and then populate the new item sequences by randomly walking on the relationship graph, as shown in Figure 4.7(c);
(4) Feed the item sequences generated here into the Word2vec model – as shown in Figure 4.7(d) – to generate the final item embedding vectors.
In the algorithm flow of DeepWalk, the only thing that needs to be formally defined is the transition probability of random walk, that is, the probability of traversing the adjacent point of
after reaching the node
. If the item relational graph is a directed weighted graph, then the probability of jumping from node
to node
is defined as

where is the set of all edges in the item relational graph,
is the set of all outgoing edges of node
, and
is the weight of the edge from node
to node
. This shows the transition probability of DeepWalk is the ratio of the current edge’s weight against the sum of all the outgoing edge weights.
If the item relational graph is an undirected and unweighted graph, then the transition probability will be a special case of Eq. 4.5, that is, the weight will be a constant 1, and
should be the set of all thredges of node
, rather than the set of all the outgoing edges.
4.4.2 Node2vec: Trade-Offs between Homophily and Structural Equivalence
In 2016, researchers from Stanford University went a step further on the basis of DeepWalk and proposed the Node2vec model [Reference Grover and Leskovec8], which enables the graph embedding method to balance between the network homophily and structural equivalence.
Specifically, the “homophily” of the network means that the embeddings of nodes with similar distances should be as similar as possible. As shown in Figure 4.8, the embedding expressions of node and its connected nodes
,
,
, and
should be close to each other, which is the embodiment of network homogeneity.

Figure 4.8 Schematic diagram of breadth-first search (BFS) and depth-first search (DFS) of the network.
“Structural equivalence” means that the embedding of structurally similar nodes should be as similar as possible. In Figure 4.8, node and node
are the central nodes of their respective local area networks. They are similar in structure, and their embedding expressions should also be similar. This type of similarity is referred to as structural equivalence.
In order to make the results of graph embedding express the “structure” of the network, in the process of random walk, it is necessary to make the node jumping more inclined to breadth-first search (BFS). BFS will traverse more in the neighborhood of the current node, and this process is equivalent to a “local scan” of the network structure around the current node. Whether the current node is a local center node, an edge node, or a connectivity node, the number and order of nodes contained in the generated sequence starting from such nodes must be different so that the final embedding captures more structural information.
In addition, in order to express network “homophily,” it is necessary to make the process of random walk more inclined to depth-first search (DFS), because DFS is more likely to walk to distant nodes through multiple jumps. However, in any case, the walk of DFS is more likely to stay within a large cluster, which makes the embedding of nodes within a cluster or community more similar. Thus, it can express the “homophily” of the network better.
So in the Node2vec algorithm, how do we control the balance between BFS and DFS? The answer is that it is controlled mainly through the transition probability between nodes. Figure 4.9 shows the transition probability of Node2vec algorithm jumping from node to node
, and then jumping from node
to the surrounding points.

Figure 4.9 The illustration of transition probability in Node2vec.
The probability of jumping from node v to the next node x is , where
is the weight of edge connecting node
and
, and
is defined as in Eq. 4.6,

where refers to the distance from node
to node
, and parameters
and
jointly control the tendency of random walk. The parameter
is called the return parameter. The smaller
is, the greater the probability of random walking back to node
. In this case, Node2vec emphasizes more the structural equivalence of the network. The parameter q is called the in-out parameter. The smaller q is, the more likely it is to randomly walk to the distant nodes, and then Node2vec will capture more network homophily.
Node2vec’s flexibility for balancing network homophily and structural equivalence was validated by experiments with different hyper-parameters and
. Figure 4.10(a) shows a visualization with Node2vec reflecting more homophily. It can be seen that the colors of nodes with similar distances are closer. Figure 4.10(b) depicts the structure when Node2vec captures more structural equivalence, where the nodes with similar structural characteristics have the same colors.

Figure 4.10 Visualization of Node2vec results with more emphasis on (a) homophily and (b) structural equivalence.
The concepts of the homophily and structural equivalence of the network in Node2vec can be intuitively explained in recommender systems. Items with the same homophily are likely to be products of the same category or attributes, or frequently purchased together, while items with the same structural equivalence are those with similar trends or similar structural properties, such as the most popular items in each category, the best “always-buy-together” items in each category, and so on. There is no doubt that both are very important feature expressions in recommender systems. Due to the flexibility of Node2vec and the ability to explore different graph features, it is even possible to feed both the embedding with training emphasis on structural equivalence and the homogeneous embedding into the subsequent deep learning network, so as to retain different graph feature information of items.
4.4.3 EGES: A Comprehensive Graph Embedding Method from Alibaba
In 2018, Alibaba published its embedding method – Enhanced Graph Embedding with Side Information (EGES) [Reference Wang9] applied in Taobao.com. The basic idea was to introduce supplementary information based on the graph embedding generated by DeepWalk.
Embedding of the item can be generated by simply using the item related graph generated by user behavior. But if you encounter a newly added item or a “long tail” item without much interactive information, the recommender systems will have a serious cold start problem. In order to obtain a “reasonable” initial embedding for cold start products, the Alibaba method enriches the source of embedding information by introducing more side information, so that products without historical behavior records can obtain a more reasonable initial embedding.
The first step in generating graph embedding is to generate an item relationship graph. This graph can be generated through the sequence of user behaviors. It is also possible to use information such as same attribute and same category to establish edges between items to generate a content-based knowledge graph. The item vector generated based on the knowledge graph can be called the supplementary information embedding vector. Of course, there can be multiple supplementary information embedding vectors according to different types of supplementary information.
How does one fuse multiple embedding vectors of an item to form the final embedding of the item? The easiest way is to add an average pooling layer to the deep neural network and take the average of different embeddings. In order to prevent the loss of useful embedding information caused by simple average pooling, the Alibaba team has strengthened it on top of average pooling by adding weights to each embedding (similar to the attention mechanism of the DIN model), as shown in Figure 4.11. The EGES model assigns different weights to the embedding vectors corresponding to each type of feature. The hidden representation layer in Figure 4.11 is the layer that performs a weighted average operation on different embeddings. The weighted average embedding vector is input into the softmax layer, and the weight of each embedding
is obtained through gradient back propagation.

Figure 4.11 EGES model.
In the actual model, Alibaba uses instead of
as the weight of the corresponding embedding vector. We think there may be two main reasons for this: one is to avoid the zero-valued weight; the other is because
has better mathematical properties in the gradient descent process.
EGES does not have overly complicated theoretical innovations, but it provides an engineering method of integrating multiple embeddings, which reduces the impact of the cold start problem caused by the lack of certain types of information. It is an embedding method that is suitable for practical adoption.
Graph embedding remains a hot topic in research and practice in both industry and academia. In addition to the mainstream methods such as DeepWalk, Node2vec, and EGES introduced in this section, LINE [Reference Tang10], SDNE [Reference Wang, Cui and Zhu11], and other methods are also important graph embedding models. Interested readers can learn about these further by reading the references.
4.5 Integration of Embedding and Deep Learning Recommender Systems
We have introduced the principles and development process of embedding. But in the real implementation of recommender systems, embedding needs to integrate with other parts of the deep learning network to complete the whole recommendation process. As an integral part of the deep learning recommender systems, embedding technology is mainly used in the following three ways:
(1) As an embedding layer in the deep learning network, it converts the input features from high-dimensional sparse vectors to low-dimensional dense feature vectors;
(2) As a pre-trained embedding feature vector, after connecting with other feature vectors, it can serve as an input to the deep learning network for training;
(3) By calculating the similarity between user embedding and item embedding, embedding can be directly used as one of the retrieval layers or retrieval strategies of the recommender system.
In this chapter, we will describe the detailed methods for combining embedding and deep learning recommender systems.
4.5.1 Embedding Layer in Deep Learning Networks
High-dimensional sparse feature vectors are naturally not suitable for training complex multilayer neural networks due to the number of weights. Therefore, if a deep learning model is used to process high-dimensional sparse feature vectors, an embedding layer is almost always added between the input layer and the fully connected layer to complete the mapping of the high-dimensional sparse feature vector to the low-dimensional dense feature vectors. This technique is adopted in most of the recommendation models introduced in Chapter 3. The embedding layers of the three typical deep learning models Deep Crossing, NerualCF, and Wide&Deep are circled in red in Figure 4.12.

Figure 4.12 Embedding layers in the Deep Crossing, NerualCF, and Wide&Deep models.
It can be clearly seen that the embedding layers of the three models use the one-hot encoded vectors of categorical features as input, and then output the low-dimensional embedding vectors. So structurally, the embedding layer in the deep neural network is a direct mapping of a high-dimensional vector to a low-dimensional vector (as shown in Figure 4.13).

Figure 4.13 Graphical and matrix representation of the embedding layer.
After expressing the embedding layer in the form of a matrix, the embedding mapping is essentially a process of solving a weight matrix of , where
is the dimension of the input high-dimensional sparse vector and
is the dimension of the output dense vector. If the input vector is a one-hot encoded feature vector, the column vector in the weight matrix is the embedding vector of the one-hot feature at the corresponding dimension.
It is theoretically optimal to integrate the embedding layer with the entire deep learning network for training, because the upper-layer gradient can be directly back-propagated to the input layer, and the model as a whole is self-consistent. However, the disadvantage of this structure is obvious. The dimension of the input vector of the embedding layer is often very large, resulting in a huge number of parameters of the entire embedding layer. Therefore, the addition of the embedding layer will slow down the convergence speed of the entire neural network. This has been discussed in Section 3.7.
4.5.2 Pre-Training Method for Embedding
In order to solve the problem of the huge training cost in the embedding layer, the training of embedding is often performed independently of the deep learning network. After the dense representation of the sparse features is obtained, it is then fed into the neural network together with other features for training the deep learning network.
A typical model using the embedding pre-training method is the FNN model introduced in Section 3.7. This uses each feature latent vector obtained by the FM model training as the initialization weight of the embedding layer, thereby accelerating the convergence speed of the entire network.
In the original implementation of the FNN model, the entire gradient descent process will still update the weight of the embedding. If you want to further speed up the convergence speed of the network, you can also fix the weight of the embedding layer and only update the weight of the upper neural network, which makes the training efficient.
To extend this, the idea of embedding is to establish a mapping from high-dimensional vectors to low-dimensional vectors. The mapping method is not limited to neural networks, but can be any heterogeneous model. For example, in the GBDT+LR combination model introduced in Section 2.6, the GBDT part is essentially an embedding operation. The GBDT model is used to complete the embedding pre-training, and then the generated embedding vectors are input into the single-layer neural network (that is, logistic regression) for CTR prediction.
Since 2015, with the development of graph embedding technology, the expressivity of embedding itself has been further enhanced, and all kinds of supplementary information can be integrated into embedding. This makes embedding a very valuable feature of recommender systems. Usually, the training process of graph embedding can only be performed independently of the recommendation model, which makes the pre-training approach a more popular embedding training practice in the field of deep learning recommender systems.
It is true that separating the embedding training from the training process of the deep neural network will lead to loss of information. but the independence of the training process also brings an improvement of training flexibility. For example, the embedding of an item or user is relatively stable, since the user’s interest and the attributes of the item usually do not change dramatically within a few days. So the embedding model refreshing does not need to be very high, and can even be as low as weekly. However, in order to grasp the latest overall trend in the data as soon as possible, the upper-layer neural network often requires more frequent training or even online learning. Using different training frequencies to update the embedding model and the neural network model is the optimal solution after a trade-off between training overhead and model freshness.
4.5.3 Application of Embedding in the Retrieval Layer of Recommender Systems
The enhancement of embedding’s expressivity makes it a feasible choice in the direct use of embedding to generate recommendation lists. Therefore, the application of embedding in the retrieval layer to compute the similarity of representation vectors has gradually become popular in many recommender systems. Among them, the solution of the YouTube recommender systems retrieval layer (as shown in Figure 4.14) is a typical method of using embedding to retrieve the candidate items.

Figure 4.14 Structure diagram of the retrieval layer model in YouTube recommender systems.
Figure 4.14 illustrates the structure of the retrieval layer model in the YouTube recommender systems. The input layer features of the model are all user-related features. From left to right are the embedding vector of the user’s viewing history video, the embedding vector of the user’s search word, the embedding vector of the user’s geographic attribute feature, the age of the user (sample), and the gender-related features.
The output layer of the model is a softmax layer. This model is essentially a multiclass classification model. The prediction objective is to guess which video the user will watch. Therefore, the input of the softmax layer is the user embedding generated by the three-layer ReLU fully connected layer, and the output vector is the user watching probability distribution for each video. Since each dimension of the output vector corresponds to a video, the column vector of the softmax layer corresponding to this dimension can be used as the item embedding. Through offline training of the model, each user’s embedding and item embedding can be finally obtained.
With the model deployment in production, it is not necessary to deploy the entire deep neural network to complete the prediction process from the original feature vector to the final output. It is necessary only to store the user and item embeddings in the online memory database. Through the inner production of user and item embeddings, we can compute the similarity between the user and the candidate item. After sorting by the similarity score, we can get the Top N most relevant items from the candidate set for the next steps. This is the process of utilizing embedding technique to retrieve the relevant candidate items.
This process takes time complexity to traverse all the items in the candidates set, where
is the candidate pool size. In many web applications, the overall candidate set size
can easily reach millions, so even an
level operation can take a lot of computing time, resulting in a high latency in the online inference process. Is there an indexing method to make this candidate retrieval process faster? The answer is revealed in Section 4.6.
4.6 Locality Sensitive Hashing: A Fast Search Method for Embedding-Based Searching
As mentioned, one of the most important uses of the embedding technique is to retrieve similar items from the candidate set in recommender systems. The main function of the retrieval layer of recommender systems is to quickly reduce the candidate set size from a large scale (for example, millions) to a smaller, manageable scale (for example, thousands or hundreds). This can prevent sending all candidate items directly into the subsequent deep learning model which avoids leading to the waste of computing resources and high online inferencing latency.
Compared with traditional rule-based candidate retrieval methods, embedding technology is more suitable for solving the retrieval problem for recommender systems because of its ability to synthesize a variety of information and features in the similarity prediction. In practical application, the key to the embedding productionization is how the embedding technique can quickly process hundreds of thousands or even millions of candidates, so as to satisfy the overall end-to-end latency constraints in the entire recommendation process.
4.6.1 Fast Nearest Neighbor Search with Embedding
The traditional calculation method of embedding similarity is to apply the inner product operation between user and item embedding vectors. This means that in order to retrieve a user’s Top N relevant items, it is necessary to traverse all the items in the candidate set. Assuming that the embedding space has dimensions and the candidate set size is
, then the time complexity of traversing all the items in the candidate set is
. In a practical recommender system, the total number of candidate items
can easily reach the order of millions. So the time complexity of this traverse step is unbearable and will lead to significant latency in the online model inferencing process.
Let’s think about this process from a different angle. Since the embedding of the user and the item is in the same vector space, the process of retrieving the most relevant item to the user using the embedding vector is actually a process of searching for the nearest neighbor in this vector space. If you can find a way to quickly search for the nearest neighbors in a high-dimensional space, then the embedding fast search problem can also be solved.
The nearest neighbor search by establishing a k-dimension tree (that is, k-d tree) structure is commonly used for the fast nearest neighbor search method, and the time complexity can be reduced to . However, the structure of the k-d tree is more complex, and it often needs backtracking when searching for the nearest neighbors to ensure the results are always the closest, which makes the search process more complicated. Moreover, the time complexity of
is not ideal. So is there a way with lower time complexity and easier operation? Next, we introduce the state-of-the-art fast nearest neighbor search method for the embedding space in practical recommender systems – Locality Sensitive Hashing (LSH) [Reference Slaney and Casey12].
4.6.2 Fundamentals of Locality Sensitive Hashing
The basic idea of LSH is to let adjacent points fall into the same “bucket,” so that when performing a nearest neighbor search, only one bucket or several adjacent buckets need to be searched. Assuming the number of elements in each bucket is a constant, the time complexity of the nearest neighbor search can be reduced to the level. Then, how are “buckets” created in LSH? Next, we take the nearest neighbor search based on Euclidean distance as an example to explain the process of constructing an LSH “bucket.”
First of all, let us clarify one question regarding the distance in different spaces. If a point in a high-dimensional space is mapped to a low-dimensional space, can its Euclidean relative distance be maintained? As shown in Figure 4.15, there are four colored dots in the middle of a two-dimensional space. When the dots in the two-dimensional space are mapped to three one-dimensional spaces a, b, and c through different angles, it can be seen that the close dots in the original two-dimensional space remain close in all one-dimensional spaces. However, the green dot and red dot that were originally far away become close in one-dimensional space a, but far away in space b.

Figure 4.15 Mapping the dots in high-dimensional space to low-dimensional space.
Thus, we can draw a qualitative conclusion – in Euclidean space, when the points in the high-dimensional space are mapped to the low-dimensional space, the originally close point must still be close in the low-dimensional space, but the originally far away points have a certain probability of becoming close.
After realizing that the low-dimensional space can retain the close-distance information from the high-dimensional space, we can construct the LSH “bucket” based on this knowledge.
For embedding vectors, inner product operations can be used to build the LSH buckets. Suppose is an embedding vector in a k-dimensional space, and
is a randomly generated k-dimensional mapping vector. As shown in Eq. 4.7, the inner product operation can map
to a one-dimensional space and become a scalar value.

It can be seen from these conclusions that one-dimensional space can partially preserve the approximate distance information of high-dimensional space. Therefore, bucketing can be performed by applying the hash function defined as follows,

where is a round-down operation,
is the width of the bucket, and
is a uniformly distributed random variable between 0 and
to avoid the solidification of the bucket boundary.
The distance information is partially lost in this mapping and bucketing operation. If only one hash function is used for bucketing, there must be some prediction errors for similar points. An effective solution is to use hash functions for simultaneous bucketing. If two points fall into the same bucket with
hashing functions, the probability that the two points are, in fact, similar is significantly increased. In this way, the approximate Top K nearest neighbors of the target point can be found by traversing the reduced candidates in the same or adjacent hashing buckets.
4.6.3 Multi-Bucket Strategy for Locality Sensitive Hashing
If multiple hash functions are used for bucketing, there is a problem to be solved – whether to use an AND operation or an OR operation to generate the final candidate set. If the candidate set is generated by an AND operation (for example, “point A and point B are in the same bucket of hash function 1” and “point A and point B are in the same bucket of hash function 2”), then the accuracy of getting the nearest neighbors in the candidate set will be improved. The reduction of the size of the traversed candidate set will reduce the amount of traversal calculation and thus reduce the overall calculation overhead. But some neighboring points (such as points near the boundary of the bucket) could be still missing. On the other hand, if an OR operation is used to generate the candidate set (for example, “point A and point B are in the same bucket of hash function 1” or “point A and point B are in the same bucket of hash function 2”), the recall rate of the nearest neighbors in the candidate set is improved. But the size of the candidate set increases and the computational cost increases. After all, it requires trade-offs between precision and recall while deciding how many hash functions are used, and whether to use an AND operation or an OR operation.
This is the LSH method of inner product operation in Euclidean space. If cosine similarity is used as the distance standard, what method should be used for bucketing?
Cosine similarity measures the size of the angle between two vectors, and the vector with the smaller angle is the “nearest neighbor.” Therefore, the vector space can be divided into different hash buckets by using a fixed-spaced hyperplane. Likewise, the precision or recall of LSH methods can be improved by choosing different sets of hyperplanes. Of course, the methods of evaluating the distance are far more than Euclidean distance and cosine similarity; they also include many other approaches such as Manhattan distance, Chebyshev distance, Hamming distance, and so on. The method of LSH could be also different based on different distance definitions. However, the general idea of using LSH bucketing to retain partial distancing information by reducing the candidate item size is common regardless of the distance-defining approach.
4.7 Summary: Core Operations of Deep Learning Recommender Systems
This chapter introduces the core operation of deep learning – embedding technology. From the original Word2vec, to Item2vec, and then to graph embedding which integrates more structural information and supplementary information, the application of embedding in recommender systems is getting deeper and deeper, and the applied methods are becoming more and more diverse. After LSH is applied to the search for similar candidate items, embedding technology is becoming more mature in terms of theory and engineering practice. Table 4.1 summarizes the basic mechanisms, characteristics, and limitations of the embedding methods mentioned in this chapter.
Table 4.1 Key points of the embedding methods and related techniques
Embedding Method | Mechanisms | Characteristics | Limitations |
---|---|---|---|
Word2vec | Use the correlation of words in sentences to build the model; use single hidden layer neural network to obtain the embedding vector of words | Classic embedding method | Can only be trained on word sequence samples |
Item2vec | Extend the idea of Word2vec to any sequence data | Extend Word2vec to the recommendation domain | Can only be trained on sequence samples |
DeepWalk | Perform random walks on the graph structure; generate sequence samples, and use the idea of Word2vec | Easy-to-use graph embedding method | Not objective oriented |
Node2vec | On the basis of DeepWalk, the results of graph embedding can balance the homophily and structural equivalence by adjusting the weight of random walk | Controllable tuning to emphasize on different graph pattern | Need more hyper-parameter tuning |
EGES | Generate the final embedding by integrating embeddings corresponding to the embeddings from different information domains | Integrate a variety of supplementary information to solve the cold start problem of embedding | No major academic innovation, but more from an engineering perspective to solve the problem of multiembedding integration |
LSH | Fast embedding vector nearest neighbor search using the principle of locality sensitive hashing | Solve the problem of fast candidate retrieval using embedding | There is a small probability of missing the nearest neighbor, which requires more hyper-parameter fine tuning |
Starting from Chapters 2 and 3, we introduce the evolution of state-of-art recommendation models. In this chapter, we focus on the embedding technology related to deep learning recommendation models. Now we complete the introduction to the relevant knowledge of recommendation models in this book.
The recommendation model is the engine that drives recommender systems to achieve personalization. It is also the key area where all the recommendation teams invest the most efforts. You must have sensed the rapid development and evolution of recommendation models in academia and industry from the contents of previous chapters. It should be noted that for a mature recommender system, in addition to recommending models, we also need to consider retrieval strategies, cold start, exploitation and exploration, model evaluation, online service, and many other issues.
In the following chapters, we view recommender systems from some different perspectives and introduce the cutting-edge technologies of different modules of recommender systems. The following contents complement the recommendation model and constitute the main framework of the deep learning recommender systems.