1. Renote understanding deep learing
Denoising diffusion models were introduced by Sohl-Dickstein et al. (2015), and early related work based on score-matching was carried out by Song & Ermon (2019). Ho et al. (2020) produced image samples that were competitive with GANs and kick-started a wave of interest in this area. Most of the exposition in this chapter including the original formulation and the reparameterization is derived from this paper. Dhariwal & Nichol (2021) improved on the quality of these results and showed for the first time that images from diffusion models were quantitatively superior to GAN models in terms of Fréchet Inception Distance. At the time of writing, the state-of-the-art results for conditional image synthesis have been achieved by Karras et al. (2022). Surveys of denoising diffusion models can be found in Croitoru et al. (2022), Cao et al. (2022), Luo (2022), and Yang et al. (2022).
Applications for images
Applications for images: Applications of diffusion models include text-to-image generation (Nichol et al., 2022; Ramesh et al., 2022; Saharia et al., 2022b), image-to-image tasks such as colorization, inpainting, uncropping and restoration (Saharia et al., 2022a), super-resolution (Saharia et al., 2022c), image editing (Hertz et al., 2022; Meng et al., 2021), removing adversarial perturbations (Nie et al., 2022), semantic segmentation (Baranchuk et al., 2022), and medical imaging (Song et al., 2021b; Chung & Ye, 2022; Chung et al., 2022; Peng et al., 2022; Xie & Li, 2022; Luo et al., 2022) where the diffusion model is sometimes used as a prior.
Different data types
Different data types: Diffusion models have also been applied to video data (Ho et al., 2022b; Harvey et al., 2022; Yang et al., 2022; Höppe et al., 2022; Voleti et al., 2022) for generation, past and future frame prediction, and interpolation. They have been used for 3D shape generation (Zhou et al., 2021; Luo & Hu, 2021), and recently a technique has been introduced to generate 3D models using only a 2D text-to-image diffusion model (Poole et al., 2023). Austin et al. (2021) and Hoogeboom et al. (2021) investigated diffusion models for discrete data. Kong et al. (2021) and Chen et al. (2021b) applied diffusion models to audio data
Alternatives to denoising
Alternatives to denoising: The diffusion models in this chapter mix noise with the data and build a model to gradually denoise the result. However, degrading the image using noise is not necessary. Rissanen et al. (2022) devised a method that progressively blurred the image and Bansal et al. (2022) show that the same ideas work with a large family of degradations which do not have to be stochastic. These include masking, morphing, blurring, and pixelating.
Comparison to other generative models
Comparison to other generative models: Diffusion models synthesize higher quality images than other generative models and are simple to train. They can be thought of as a special case of a hierarchical VAE (Vahdat & Kautz, 2020; Sønderby et al., 2016b) where the encoder is fixed and the latent space is the same size as the data. They are probabilistic, but like the VAE can only compute a lower bound on the likelihood of a data point. However, Kingma et al. (2021) show that this lower bound improves on the exact log likelihoods for test data from normalizing flows and autoregressive models. The main disadvantages of diffusion models is that they are slow and that the latent space has no semantic interpretation.
Improving quality
Improving quality: Many techniques have been proposed to improve image quality. These include the reparameterization of the network described in section 18.5 and the equal-weighting of the subsequent terms (Ho et al., 2020). Choi et al. (2022) subsequently investigated different weightings of terms in the loss function.
Ho et al. (2022a) developed the cascaded method for producing very high resolution images (figure 18.11). To prevent artifacts in lower resolution images being propagated the higher resolutions, they introduced noise conditioning augmentation; here, the lower resolution image is degraded by adding noise to the conditioning image at each training step. This reduces the reliance on the exact details of the lower resolution image during training. It is done during inference, and here where the best noise level is chosen by sweeping over different values.
Improving speed
Improving speed: One of the major drawbacks of diffusion models is that they take a long time to train and sample from. Stable diffusion (Rombach et al., 2022) projects the original data to a smaller latent space using a conventional VAE and then runs the diffusion process in this smaller space. This has the advantages of reducing the dimensionality of the training data for the diffusion process, and allowing other data types (text, graphs, etc.) to be described by diffusion models. Vahdat et al. (2021) applied a similar approach. Song et al. (2021a) showed that an entire family of diffusion processes are compatible with the training objective. Most of these processs are non-Markovian (i.e., the diffusion step does not only depend on the results of the previous step). One of these models is the denoising diffusion implicit model (DDIM) in which the updates are not stochastic (figure 18.10b). This model is amenable to taking larger steps (figure 18.10b) without inducing large errors. It effectively converts the model into an ordinary differential equation (ODE) in which the trajectories have low curvature, and allows efficient numerical methods for solving ODEs to be applied.
Conditional generation
Conditional generation: Dhariwal & Nichol (2021) introduced classifier guidance, in which a classifier learned to identify the category of object being synthesized at each step, and this is used to bias the denoising update towards that class. This works well, but training a separate classifier is expensive. Classifier free guidance (Ho & Salimans, 2022) concurrently trains conditional and unconditional denoising models by dropping the class information some proportion of the time in a process akin to dropout. This technique allows control of the relative contributions of the conditional and unconditional components. By over-weighting the conditional component, the model produces more typical and realistic samples
Text-to-image
Text-to-image: Before diffusion models, state-of-the-art text-to-image systems were based on transformers (Ramesh et al., 2021). GLIDE (Nichol et al., 2022) and Dall·E 2 (Ramesh et al., 2022) both conditioned on embeddings from the CLIP model (Radford et al., 2021) which generates joint embeddings for text and image data. Imagen (Saharia et al., 2022b) showed that text embeddings from a large language model could produce even better results (see figure 18.13). The same authors introduced a benchmark (DrawBench) which is designed to evaluate the ability of a model to render colors, numbers of objects, spatial relations and other characteristics. Feng et al. (2022) have developed a Chinese text-to-image model.
Connections to other models
Connections to other models: This chapter described diffusion models as hierarchical variational autoencoders because this approach connects most closely with the other parts of this book. However, diffusion models also have close connections with stochastic differential equations (consider the paths in figure 18.5) and with score matching (Song & Ermon, 2019, 2020). Song et al. (2021c) presented a framework based on stochastic differential equations that encompasses both the denoising and score matching interpretations. Diffusion models also have close connections to normalizing flows (Zhang & Chen, 2021). Yang et al. (2022) present an overview of the relation between diffusion models and other generative approaches.
2. Note for Graph neural networks
Applications
Applications: Applications include graph classification (Zhang et al., 2018b, e.g.,), node classification (e.g., Kipf & Welling, 2017), edge prediction (e.g., Zhang & Chen, 2018), graph clustering (e.g., Tsitsulin et al., 2020), and recommender systems (e.g., Wu et al., 2023). Node classification methods are reviewed in Xiao et al. (2022a), graph classification methods in Errica et al. (2019), and edge prediction methods in Mutlu et al. (2020) and Kumar et al. (2020).
Spectral methods
Spectral methods: Bruna et al. (2013) applied the convolution operation in the Fourier domain. The Fourier basis vectors can be found by taking the eigendecomposition of the graph Laplacian matrix, L = D−A where D is the degree matrix and A is the adjacency matrix. This has the disadvantages that the filters are not localized and the decomposition is prohibitively expensive for large graphs. Henaff et al. (2015) tackled the first problem by forcing the Fourier representation to be smooth (and hence the spatial domain to be localized). Defferrard et al. (2016) introduced ChebNet, which approximates the filters efficiently by making use of the recursive properties of Chebyshev polynomials. This both provides spatially localized filters and reduces the computation. Kipf & Welling (2017) simplified this further to construct filters that use only a 1-hop neighborhood, resulting in a formulation that is similar to the spatial methods described in this chapter and providing a bridge between spectral and spatial methods.
Spatial methods
Spatial methods: Spectral methods are ultimately based on the Graph Laplacian, and so if the graph changes, the model must be retrained. This problem spurred the development of spatial methods. Duvenaud et al. (2015) defined convolutions in the spatial domain, using a different weight matrix to combine the adjacent embeddings for each node degree. This has the disadvantage that it becomes impractical if some nodes have a very large number of connections. Diffusion convolutional neural networks (Atwood & Towsley, 2016) use powers of the normalized adjacency matrix to blend features across different scales, sum these, pointwise multiply by weights, and pass through an activation function to create the node embeddings. Gilmer et al. (2017) introduced message passing neural networks, which defined convolutions on the graph as propagating messages from spatial neighbors. The “aggregate and combine” formulation of GraphSAGE (Hamilton et al., 2017a) fits into this framework.
Aggregate and combine:
Graph convolutional networks (Kipf & Welling, 2017) take a weighted average of the neighbors and current node and then apply a linear mapping and ReLU. GraphSAGE (Hamilton et al., 2017a) applies a neural network layer to each neighbor and then takes the elementwise maximum to aggregate. Chiang et al. (2019) propose diagonal enhancement in which the previous embedding is weighted more than the neighbors. Kipf & Welling (2017) introduced Kipf normalization, which normalizes the sum of the neighboring embeddings based on the degrees of the current node and its neighbors (see equation 13.19). The mixture model network or MoNet (Monti et al., 2017) takes this one step further by learning a weighting based on these two quantities. They associate a pseudo-coordinate system with each node, where the positions of the neighbors depend on the degrees of the current node and the neighbor. They then learn a continuous function based on a mixture of Gaussians and sample this at the pseudo-coordinates of the neighbors to get the weights. In this way, they can learn the weightings for nodes and neighbors with arbitrary degrees. Pham et al. (2017) use a linear interpolation of the node embedding and neighbors with a different weighted combination for each dimension. The weight of this gating mechanism is generated as a function of the data.
Residual connections:
Kipf & Welling (2017) proposed a residual connection in which the original embeddings are added to the updated ones. Hamilton et al. (2017b) concatenate the previous embedding to the output of the next layer (see equation 13.16). Rossi et al. (2020) present an inception-style network, where the node embedding is concatenated to not only the aggregation of its neighbors but also the aggregation of all neighbors within a walk of two (via computing powers of the adjacency matrix). Xu et al. (2018) introduced jump knowledge connections in which the final output at each node consists of the concatenated node embeddings throughout the network. Zhang & Meng (2019) present a general formulation of residual embeddings called GResNet and investigate several variations, in which the embeddings from the previous layer are added, the input embeddings are added, or versions of these that aggregate information from their neighbors (without further transformation) are added.
Attention in graph neural networks:
Veličković et al. (2019) developed the graph attention network (figure 13.12c). Their formulation uses multiple heads whose outputs are combined together symmetrically. Gated Attention Networks (Zhang et al., 2018a) weight the output of the different heads in a way that depends on the data itself. Graph-BERT (Zhang et al., 2020) performs node classification using self-attention alone; the structure of the graph is captured by adding position embeddings to the data in a similar way to how the absolute or relative position of words is captured in the transformer (chapter 12). For example, they add positional information that depends on the number of hops between nodes in the graph
Permutation invariance:
In DeepSets, Zaheer et al. (2017) presented a general permutation invariant operator for processing sets. Janossy pooling (Murphy et al., 2018) accepts that many functions are not permutation equivariant, and instead uses a permutation-sensitive function and averages the results across many permutations.
Edge graphs:
The notation of the edge graph, line graph, or adjoint graph dates to Whitney (1932). The idea of “weaving” layers that update node embeddings from node embeddings, node embeddings from edge embeddings, edge embeddings from edge embeddings, and edge embeddings from node embeddings was proposed by Kearnes et al. (2016), although here the node-node and edge-edge updates do not involve the neighbors. Monti et al. (2018) introduced the dual-primal graph CNN, which is a modern formulation in a CNN framework that alternates between updating on the original graph and the edge graph.
Multi-relational graphs:
Multi-relational graphs: Schlichtkrull et al. (2018) proposed a variation of graph convolutional networks for multi-relational graphs (i.e., graphs with more than one edge type). Their scheme separately aggregates information from each edge type, using different parameters. If there are many edge types, then the number of parameters may become large, and to combat this they propose that each edge type uses a different weighting of a basis set of parameters.
Hierarchical representations and pooling
Hierarchical representations and pooling: CNNs for image classification gradually decrease the size of the representation, but increase the number of channels as the network progresses. However, the GCNs for graph classification in this chapter maintain the entire graph until the last layer and then combine all of the nodes to compute the final prediction. Ying et al. (2018b) propose DiffPool, which clusters graph nodes to make a graph that gets progressively smaller as the depth increases in a way that is differentiable, so can be learned. This can be done based on the graph structure alone, or adaptively based on the graph structure and the embeddings. Other pooling methods include SortPool (Zhang et al., 2018b) and self-attention graph-pooling (Lee et al., 2019). A comparison of pooling layers for graph neural networks can be found in Grattarola et al. (2022). Gao & Ji (2019) propose an encoder-decoder structure for graphs based on the U-Net (see figure 11.10).
Geometric graphs
Geometric graphs: MoNet (Monti et al., 2017) can be easily adapted to exploit geometric information because neighboring nodes have well-defined spatial positions. They learn a mixture of Gaussians function and sample from this based on the relative coordinates of the neighbor. In this way, they can weight neighboring nodes based on their relative positions as in standard convolutional neural networks even though these positions are not consistent. The geodesic CNN (Masci et al., 2015) and anisotropic CNN (Boscaini et al., 2016) both adapt convolution to manifolds (i.e., surfaces) as represented by triangular meshes. They locally approximate the surface as a plane and define a coordinate system on this plane around the current node
Tài liệu tham khảo
Internet
Hết.