DL

Self-attention

I. Self-attention Overview Input: A sequence of vectors (size not fixed) Output: Each vector has a label (POS tagging) The whole sequence has a label (sentiment analysis) Model decides the number of labels itself (seq2seq) Self-attention can handle global iformation, and FC can handle local information. Self-attention is the key module of Transformer, which will be shared in other articles. II. How to work? Firstly, we should figure out the relevance between each vector.

June 11, 2021 Read

Transformer

I. Transformer Overview Seq2seq model Input: sequence of vector Output: sequence (not fixed size) II. Encoder It is actually a Self-Attention Model! 对于一个block，它的结构可以理解为以下的形式：与self-attention不同的是，会采用residual add的方式，将self-attention得到的中间结果$a$加上输出$b$ 经过layer normalization得到新的输出将新的output输入到FC中，并加上residual 再次经过layer normalization得到这一个block的结果。理解了一个block以后，整个Encoder就是由n个这种block组成的network。首先将word进行self-attention得到word embeddding，并在其中加入positional encoding信息经过multi-head Attention或者Feed Forward Network后，都要接上residual + layer normalization，而这种设计结构就是Transformer Encoder的创新点所在。 III. Decoder —— Autoregressive(AT) The model structure of Decoder is shown as follow: 根据上图我们将逐步逐层地解释结构： Masked Multi-head Self-attention 在产生每一个$b^i$ vector时，不能再看后面的信息，只能和前面的输入进行关系：从细节上来看，要想得出$b^2$，我们只需将其和$k^1, k^2$做dot-product。

June 11, 2021 Read

Recursive Neural Network

I. RNN Structure Overview The input is a sequence of vectors. note: Changing the input sequence order will change the output We use the same neural network to train, each color in NN means the same weight. When the values stored in the memory is different, the output will also be different. II. Types of RNN Elman’s memory restore the values of hidden layer output, and Jordan’s memory restore the values of output.

October 15, 2020 Read

Convolutional Neural Network

I. CNN Structure Overview II. Convolution Note: 1.Every elements in filter are the network parameter to be learned. 2.Stride means each step you walk from previous position. 3.The size of filter is decided by programmer. From the picture we could know the largest values of Feature Map means there has a feature. Then we do the same process for every filter and generate more Feature Map. If we deal with colorful image, we will use Filter Cube instead of matrix.

September 14, 2020 Read

Introduction to Deep Learning

I. Basic Concepts 1. Fully Connected Feedforward Network 2. Matrix Operation Every layer has weight matrix and bias matrix, using matrix operation we can accumulate the output matrix $y$. Tips: Using GPU could speed up matrix operation. II. Why Deep Learning? 1. Modularization 对neural network而言，并不是神经元越多越好，通过例子可以看出层数的增加（more deep）对于准确率的提升更有效果。这其中就是 Modularization 的思想。For example, while you are trying to train the model below, you can use basic classifiers as module. Each basic classifier can have sufficient training examples.

August 10, 2020 Read

Optimization for Deep Learning

Some Notation: $\theta_t$: model parameter at time step t $\nabla$$L(\theta_t)$ or $g_t$: gradient at $\theta_t$, used to compute $\theta_{t+1}$ $m_{t+1}$: momentum accumlated from time step 0 to time step t, which is used to compute $\theta_{t+1}$ I. Adaptive Learning Rates In gradient descent, we need to set the learning rate to converge properly and find the local minima. But sometimes it’s difficult to find a proper value of the learning rate.

July 25, 2020 Read