type
status
date
slug
category
password
tags
Lecture 1 Introduction of Deep Learning
The most common form of machine learning, deep or not, is supervised learning
Supervised Learning
Supervised learning is a type of machine learning where the algorithm learns a mapping from input data to output labels based on labeled training data. In supervised learning, the algorithm is provided with a dataset consisting of input-output pairs, where each input is associated with a corresponding output label. The goal is for the algorithm to learn a mapping or relationship between the inputs and outputs so that it can accurately predict the output labels for new, unseen inputs.
Key characteristics of supervised learning include:
- Training with Labeled Data: The training data provided to the algorithm is labeled, meaning that each input is paired with the correct output label. This allows the algorithm to learn from the examples provided.
- Prediction of Outputs: Once trained, the algorithm can predict output labels for new input data that it has not seen before. These predictions are based on the learned mapping from the training data.
- Types of Supervised Learning:
- Classification: In classification tasks, the output labels are discrete categories or classes. The algorithm learns to classify inputs into one of these predefined classes. Example: predicting whether an email is spam or not spam.
- Regression: In regression tasks, the output labels are continuous numerical values. The algorithm learns to predict a value based on input features. Example: predicting house prices based on features like size, number of bedrooms, etc.
- Evaluation: The performance of a supervised learning algorithm is typically evaluated using metrics such as accuracy (for classification tasks) or mean squared error (for regression tasks) on a separate validation or test dataset.
Supervised learning is widely used in various domains, including image recognition, natural language processing, medical diagnosis, and financial forecasting. It forms the basis for many practical applications of machine learning in real-world scenarios.
Forward Propagation and Backward Propagation
Forward propagation and backward propagation are two essential steps in training neural networks in deep learning.
- Forward Propagation:
- During forward propagation, input data passes through the neural network, undergoing a series of operations such as weighted summation and activation functions at each layer, ultimately producing the network's output.
- The purpose of forward propagation is to compute the network's predictions or outputs, which are then compared with the true labels to calculate the loss function.
- Backward Propagation:
- In backward propagation, the gradient of the loss function is computed with respect to the parameters of the network, and this gradient is propagated backward through the network, from the output layer to the input layer.
- The goal of backward propagation is to compute the gradients of the loss function with respect to the network parameters, enabling the optimization algorithm (e.g., gradient descent) to update the parameters in a way that minimizes the loss function.
In summary, forward propagation is used to compute the network's outputs, while backward propagation is used to compute the gradients of the loss function with respect to the network parameters. Together, they form the training process of a neural network, allowing it to learn from data and optimize its parameters to perform tasks such as classification, regression, etc. Forward propagation and backward propagation are fundamental concepts in deep learning, enabling neural networks to learn and improve their performance on various tasks.
Convolutional Kernel
A convolutional kernel is a small matrix of weights used in convolutional neural networks (CNNs) to extract features from input data. It slides over the input data, computing dot products at each position, resulting in a feature map. Kernels are learned during training and play a key role in tasks like image recognition and object detection.
A kernel is a small matrix of weights that is applied to a region of input data. It is typically square-shaped, with dimensions such as 3×3, 5×5, although other sizes are possible.
Convolution operation , with example
Suppose we have the following 1D input signal:
f(x)=[1,2,3,4,5]
and a 1D filter (also known as kernel or mask):
g(x)=[0.5,1,0.5]
To perform the convolution operation, we slide the filter g(x) across the input signal f(x) and compute the dot product at each position. The result at each position gives us the output of the convolution operation.
Here's how it works:
- Place the filter on top of the input signal:
- Initially, the leftmost element of the filter aligns with the leftmost element of the input signal.
- Compute the dot product:
- At each position, we compute the dot product between the filter and the overlapping portion of the input signal.
- For example, at the first position: f1⋅g1+f2⋅g2+f3⋅g3=1⋅0.5+2⋅1+3⋅0.5=6
- Move the filter:
- After computing the dot product at the current position, we shift the filter to the right by one position.
- Repeat:
- We repeat steps 2 and 3 until the filter has passed over the entire input signal.
The output of the convolution operation is a new signal, typically called the feature map or the output of the convolutional layer. In this example, the output feature map would look something like:
Output=[6,8.5,11,13.5]
This is just a basic example of 1D convolution. In practice, convolution operations are performed in 2D (e.g., for image data) or even in higher dimensions. They are fundamental in convolutional neural networks (CNNs) for tasks such as image processing, feature extraction, and pattern recognition.
Pooling Layer
a pooling layer in a convolutional neural network (CNN) downsamples the feature maps, reducing their spatial dimensions while retaining important information. It operates with a sliding window over the feature map, applying operations like max pooling or average pooling to extract key features. Pooling helps control overfitting, reduces computational complexity, and introduces translation invariance, making the network more robust. However, it may lead to information loss and the loss of fine-grained details. Overall, pooling is essential for effective feature extraction and learning in CNNs.
CNN Samoyed Example:
The description provided outlines the architecture of a convolutional neural network (CNN) used for distinguishing between images of Samoyed dogs and images of other dog breeds. Let's break down the components mentioned:
- Convolutional Layer:
- The first layer mentioned is a convolutional layer. This layer applies a set of learnable filters (also known as kernels) to the input image.
- Each filter convolves over the input image, computing a dot product between the filter weights and a small region of the input image.
- The output of this operation is a feature map that represents the presence of different patterns or features in the input image.
- The Rectified Linear Unit (ReLU) activation function is then applied element-wise to the feature map, introducing non-linearity to the network.
- Max Pooling Layer:
- The next layer mentioned is a max pooling layer. Max pooling is a downsampling operation that reduces the spatial dimensions of the feature maps while retaining the most important information.
- In a max pooling operation, a small window (typically 2x2 or 3x3) slides over the feature map, and only the maximum value within each window is retained.
- This helps in reducing the computational complexity of the network and making the learned features more robust to small translations and distortions in the input image.
- Repetition:
- The process of convolution followed by ReLU activation and max pooling is repeated multiple times in the network.
- Each repetition allows the network to learn increasingly complex and abstract features from the input image.
- As the network progresses deeper, the spatial dimensions of the feature maps decrease while the number of channels (or depth) typically increases.
By stacking multiple convolutional layers with ReLU activation and max pooling layers, the network can effectively learn hierarchical representations of the input image, starting from low-level features such as edges and textures and progressing to high-level features that are discriminative for distinguishing between different dog breeds, such as the presence of fur patterns, shapes of ears, and facial features specific to Samoyed dogs. This hierarchical feature learning process enables the CNN to effectively classify images of Samoyed dogs from images of other dog breeds.
Stochastic Gradient Descent (SGD):
Gradient descent is a fundamental optimization algorithm used to minimize a function by iteratively moving in the direction of the steepest descent of the function's gradient. It is commonly employed in machine learning and deep learning to optimize the parameters of a model, such as weights and biases, with respect to a given objective function, typically a loss function.
Here's how gradient descent works:
- Initialization: The algorithm starts with an initial guess for the parameters of the model. These parameters could be randomly initialized or set to some predefined values.
- Compute Gradient: At each iteration, the gradient of the objective function with respect to the parameters is computed. This gradient represents the direction of the steepest increase of the function.
- Update Parameters: The parameters are then updated in the direction opposite to the gradient, scaled by a factor known as the learning rate. The learning rate determines the size of the steps taken during each iteration.
- Convergence: Steps 2 and 3 are repeated iteratively until a stopping criterion is met. This could be a fixed number of iterations, convergence of the objective function, or reaching a predefined threshold for the change in parameters.
linear classifiers are applied on top of hand-engineered features
machine learning where linear classifiers are applied on top of hand-engineered features for practical applications. Here's a breakdown of the key points:
- Linear Classifiers: These are algorithms used for classification tasks where the decision boundary separating different classes is linear. Examples include logistic regression and linear support vector machines (SVM). Linear classifiers compute a weighted sum of the feature vector components, which is then used to make predictions.
- Hand-Engineered Features: These are features extracted from the input data using domain knowledge or manual feature engineering techniques. Hand-engineered features are designed to capture relevant information that is useful for the classification task. Examples of hand-engineered features include color histograms for image classification or word frequencies for text classification.
- Two-Class Classification: In many practical applications, the task involves classifying inputs into one of two classes or categories. For example, classifying emails as spam or not spam, or classifying images as containing cats or dogs. A two-class linear classifier computes the weighted sum of the feature vector components and compares it to a threshold. If the weighted sum is above the threshold, the input is classified as belonging to a particular category; otherwise, it is classified as belonging to the other category.
backpropagate gradients
The conventional option is
to hand design good feature extractors, which requires a considerable amount of engineering skill and domain expertise. But this can
all be avoided if good features can be learned automatically using a
general-purpose learning procedure. This is the key advantage of
deep learning.
CNN vs RNN, summarized by chatgpt 3.5
- CNNs are deep learning models commonly used for tasks such as image classification, object detection, and image segmentation. They consist of multiple layers of convolutional and pooling operations, followed by fully connected layers for classification.
- RNNs are deep learning models commonly used for tasks involving sequential data, such as natural language processing, time series analysis, and speech recognition. They have recurrent connections that allow them to maintain information over time and capture temporal dependencies in the data.
Comparison table between CNNs and RNNs with examples of algorithms for each:
Feature | Convolutional Neural Networks (CNNs) | Recurrent Neural Networks (RNNs) |
Architecture | Designed for grid-structured data (e.g., images) | Designed for sequential data (e.g., text, time series) |
Memory | No inherent memory mechanism | Internal memory mechanism to maintain information over time |
Parallelism | Operations can be parallelized across spatial dimensions | Operations are inherently sequential |
Applications | Image classification, object detection, image segmentation, etc. | Natural language processing, speech recognition, etc. |
Long-Term Dependencies | Not well-suited for capturing long-term dependencies | Designed to capture long-term dependencies in sequential data |
Examples of algorithms:
- Convolutional Neural Networks (CNNs):
- LeNet
- AlexNet
- VGG
- GoogLeNet (Inception)
- ResNet
- MobileNet
- EfficientNet
- YOLO (You Only Look Once)
- SSD (Single Shot Multibox Detector)
- Recurrent Neural Networks (RNNs):
- Vanilla RNN
- Long Short-Term Memory (LSTM)
- Gated Recurrent Unit (GRU)
- Bidirectional RNN
- Seq2Seq (Sequence-to-Sequence)
- Transformer
- BERT (Bidirectional Encoder Representations from Transformers)
- WaveNet
Related Mathematics Knowledge
Both partial derivatives and derivatives are concepts from calculus, and they both involve the rate of change of a function. However, they have slightly different applications and interpretations.
- Derivative:
- The derivative of a function represents the rate of change of that function with respect to a single variable.
- It measures how the function's output changes as the input variable changes.
- Symbolically, if y=f(x), then the derivative of f(x) with respect to x is denoted as dy/dx or f′(x).
- Geometrically, the derivative gives the slope of the tangent line to the curve of the function at a specific point.
- Partial Derivative:
- The partial derivative of a function represents the rate of change of that function with respect to one of its several variables, while holding all other variables constant.
- It is used for functions with multiple variables, where the rate of change may vary in different directions.
- Symbolically, if a function f depends on multiple variables, say x and y, then the partial derivative of f with respect to x is denoted as ∂f/∂x.
- Geometrically, the partial derivative gives the slope of the tangent line to the curve of the function in the direction of the specified variable.
In summary, derivatives deal with functions of one variable and measure the rate of change with respect to that variable, while partial derivatives deal with functions of multiple variables and measure the rate of change with respect to one variable while holding others constant. Both concepts are fundamental in calculus and are extensively used in various fields of mathematics, science, and engineering.
Reference
Question 3: What are the concepts: Loop, iteration, and recursion in programming?
Loop = repeat
iteration = one by one
recursion = call itself
Question 4: i++
Operator | Description | Example (assuming i is initially 5) |
i++ | Post-increment: Increment i after evaluation | i++ evaluates to 5, then sets i to 6 |
i-- | Post-decrement: Decrement i after evaluation | i-- evaluates to 5, then sets i to 4 |
++i | Pre-increment: Increment i before evaluation | ++i sets i to 6, then evaluates to 6 |
--i | Pre-decrement: Decrement i before evaluation | --i sets i to 4, then evaluates to 4 |
Question 5: What is the norm of a vector? What is p-norm?
在数学中,向量的范数是其大小或长度的度量。它是一个函数,为向量空间中的每个向量分配一个非负值,使得向量的长度为零当且仅当向量为零向量,并且满足一定的性质,如正齐次性、三角不等式和次加性。
Question 6: How to write 𝑓(𝑥) = 𝑎𝑥2 +𝑏𝑥 + 𝑐, 𝑥 ∈ [𝑥1, 𝑥2] in a matrix way?
Semantic Segmetation
Semantic segmentation is a computer vision task where the goal is to classify each pixel in an image into a predefined set of categories. Here's a step-by-step explanation of the process:
- Labeling Your Data:
- Semantic segmentation requires labeled data where each pixel in an image is annotated with the class it belongs to (e.g., road, car, pedestrian). This labeling process is often done manually by humans, although there are some automated methods available.
- Create Two Datastores:
- Image Datastore: This contains the input images for training and testing.
- Pixel Label Datastore: This contains the corresponding labeled images where each pixel is labeled with its class.
- Partition Datastores:
- After creating the datastores, you split them into separate subsets for training and testing. This ensures that the model is trained on one set of data and evaluated on another set to assess its generalization performance.
- Import Pre-trained Model and Modify to SegNet:
- Pre-trained models, especially in deep learning, are often used as a starting point due to their learned representations. These models are imported and adapted for semantic segmentation. SegNet is one such architecture specifically designed for semantic segmentation.
- The pre-trained model's architecture is adjusted to suit the task of semantic segmentation. This typically involves modifying the last few layers of the network to output pixel-wise class predictions instead of some other task like image classification.
- Train and Evaluate:
- With the modified model and partitioned data, the training process begins. During training, the model learns to predict the correct class labels for each pixel in the input images.
- After training, the model's performance is evaluated using the test data. This evaluation usually involves metrics such as pixel accuracy, mean Intersection over Union (IoU), or F1-score to quantify how well the model performs at segmenting objects in unseen images.
Lecture 3: Artificial Neural Network
Vocabulories
- 神经网络直觉 (Neural networks intuition)
- 随机梯度下降 (Stochastic gradient descent)
- 人工神经元 (Artificial neuron)
- 感知器 (Perceptron)
- Sigmoid 神经元 (Sigmoid neuron)
- 做出相当微妙的决定 (Make quite subtle decisions)
- 更抽象的层面 (More abstract level)
- 复杂的决策制定 (Sophisticated decision making)
- 偏差 (Bias)
- 繁琐的 (Cumbersome)
- 进行两次符号变更 (Make two notational changes)
- 点乘 (Dot product)
- 与非门 (NAND gate)
- 明确表示乘法 (Multiplications explicit)
- 按位求和 (Bitwise sum)
- 代数形式 (Algebraic form)
- 灰度图像(greyscale image)
- 最终的理由是经验主义(The ultimate justification is empirical)
1. What is ANN, CNN and DNN in this course
ANN (Artificial Neural Network), CNN (Convolutional Neural Network), and DNN (Deep Neural Network) are concepts in deep learning.
- ANN: Basic neural network model that simulates the human brain's neuron connections. It's composed of interconnected nodes, used for tasks like classification and regression.
- CNN: Specialized for grid-like data like images, it consists of convolutional and pooling layers. Great for computer vision tasks due to its ability to extract features efficiently.
- DNN: Refers to neural networks with multiple hidden layers. It's a broader term encompassing various architectures, including CNN. DNNs can learn complex features for various tasks.
2. What is Perceptron, Sigmoid, and MLP
Perceptron, Sigmoid, and MLP (Multi-Layer Perceptron) are different concepts in the field of neural networks, and they have certain relationships:
- Perceptron:
- Perceptron is one of the simplest artificial neuron models, proposed by American psychologist Frank Rosenblatt in 1957.
- The perceptron takes inputs and produces an output, which is the result of the weighted sum of inputs passed through a threshold function (such as a step function).
- Sigmoid Function:
- The sigmoid function is a commonly used activation function, typically employed in the hidden layers and output layer of neural networks.
- Its output ranges from 0 to 1, exhibiting continuity and differentiability, enabling effective parameter updates using optimization algorithms like gradient descent.
- The sigmoid function can map the output of perceptrons to the (0, 1) range, providing a nonlinear transformation.
- MLP (Multi-Layer Perceptron):
- MLP is a neural network structure consisting of multiple layers of neurons, typically including input layer, multiple hidden layers, and output layer.
- In MLP, each neuron is connected to all neurons in the previous layer, and each connection has a weight.
- The sigmoid function is commonly used as the activation function in the hidden layers and output layer of MLP, enabling the network to learn nonlinear features and relationships.
Perceptrons, Sigmoid neurons, and MLPs are all examples of
feedforward neural networks
. They follow a strict forward flow of information without any feedback loops.3. Perceptrons
Perceptron is a basic model of Neural Network,
A perceptron takes several binary inputs, x1,x2,… , and produces a single binary output.
Weights are introduced, w1,w2,… , real numbers expressing the importance of the respective inputs to the output.
4. Sigmoid Neuron
Sigmoid neurons have much of the same qualitative behaviour as perceptrons, they make it much easier to figure out how changing the weights and biases will change the output.
This linearity makes it easy to choose small changes in the weights and biases to achieve any desired small change in the output.
5. Multilayer Perceptrons
Somewhat confusingly, and for historical reasons, such multiple layer networks are sometimes called multilayer perceptrons or MLPs, despite being made up of sigmoid neurons, not perceptrons.
6. What’s in the input layer
We'll use the notation x to denote a training input. It'll be convenient to regard each training input x as a 28×28=784 dimensional vector. Each entry in the vector represents the grey value for a single pixel in the image. We'll denote the corresponding desired output by y=y(x), where y is a 10 dimensional vector. For example, if a particular training image, x, depicts a 6, then y(x)=(0,0,0,0,0,0,1,0,0,0)T is the desired output from the network. Note that T here is the transpose operation, turning a row vector into an ordinary (column) vector.
Q & A From Lecture 3
(1) The goal during the training is to minimize the loss functions, and the choice of the loss functions depends on the type of the problem, such as regression, classification, ..., etc.
Mean Square Error(MSE) function is the most common used for regression model, which is used for predicting stock price, diagnosing diseases, predicting sales:
MSE = (1/n) * Σ(y_i - f(x_i))^2
The most common used for classification is Cross Entropy Loss:
Cross Entropy Loss = -Σ(y_i * log(p_i))
(2) Activation Function in a multiple layer neural network acts as a simple processing unit. It receives signals from the front layers as its inputs, multiplies each input by a weight and then sums those weighted values, and then acts as an input for the behind layers. The reason/result for ActF is to involve non-linearity into the network by stacking multiple layers with different activation functions.
There are four common ActF usde for different situations:
- Sigmoid: It squashes the output to a range between 0 and 1. It's useful in binary classification problems where the output needs to be interpreted as a probability. But because its derivative is between0-0.25, in the backpropagation processing, when the layers are too many,it's easy to fail named as Vanishing Gradient, and the intial input can hardly affect the final output.
- Hyperbolic Tangent (tanh): Similar to the sigmoid function, but squashes the output to a range between -1 and 1. Prefer to be used more than Sigmoid, It tends to make the distribution of outputs more centered around zero.
- Rectified Linear Unit (ReLU): It returns zero for negative inputs and linearly increases for positive inputs. It's one of the most popular activation functions because of its simplicity and effectiveness in training deep neural networks.
- Leaky ReLU: Similar to ReLU, but it allows a small, positive gradient for negative inputs, which helps with the vanishing gradient problem. For the negative inputs, it simply get the absolute value and multiply by a tiny value like 0.01.
(3) Backpropagation algorithm is based on the chain rule, d[f(g(x))] / dx = f'(g(x)) * g'(x), a tiny change from the weights parameters can mildly affect the output f(g(x)). During the process of training the model, which is also the process of minimizing the loss function, we are repeating the tiny change of the output towards the direction of lower loss function, when we reach the minimum loss function, we get the best parameters, and the model is ready. In technical terms, this gradient change process is called Stochastic Gradient Descent (SGD).
(4) The confusion matrix is used to evaluation the pattern classification model, meanwhile, by achiving higher Recall Accuracy or Precission Accuracy with manipulating the β value in Fβ-score, the confusion matrix can give you different classification standard and result. For example, if you want to find a boyfirend in Univisity and you are very causious, you don't want to meet a bad guy, you need to aim at high Precission Accuracy, make sure the boys in the results are highly trustable. But if you are a police man who wants to find out who is the killer of the floating body in the sea, you need aim at Recall Accuracy, you can investigate the wrong guy, but you don't want it slip away.
ANN Toolbox Examples:
Wine Classification
- dataset preparation: The 178 wine samples are automatically divided into training, validation and test sets.
- The input X is a 13 X 178 matrix, it has 178 columns as 178 wine samples, 13 rows as 13 elements, I guess these 13 elements are the features of the wines, like nose, palate, finish, color, spicy level, aroma, … ,. The value I guess, for example, for nose, it may be 1 - 5, representing 5 levels. This need expert information or so called auxiliary information.
- The output T is 3 X 178 matrix, it has 178 columns as results of 178 wine samples, 3 rows meaning which winery from the 3 wineries the wine belong to. The value is 1 if it belongs to the winery, and 0 if not.
- The neural network has a single hidden layer of 10 neurons, it means:
- there are 10 sets of “13 weight values and 13 bias values” from the input to the hidden layer; each set of the “13 w and 13 b” can be different.
- there is 1 set of “10 weight values and 10 bias values” from the hidden layer to the output
Training and validation; 这个怎么运行的???等能上matlab了试一下
- performance evaluation:
PLOTPERFORM没搞懂,需要matlab上跑一下
- plotconfusion(testT,testY)
- plotroc(testT,testY)
Cancer Detection
Just the Same as the Wine Classification, the columns represents samples(patients), the rows are features:
Each row in
Y
represents the ion intensity level at a specific mass-charge value indicated in MZ
. There are 15000
mass-charge values in MZ
and each row in Y
represents the ion-intensity levels of the patients at that particular mass-charge value.The general neural network design process:
- Collect Data: Gather relevant data that will be used to train and test the neural network. This data should be representative of the problem domain and should include both input features and corresponding target labels or outcomes.
- Create the Network: Design the architecture of the neural network, including the number of layers, the types of layers (e.g., fully connected layers, convolutional layers, recurrent layers), and the activation functions used in each layer. This step involves deciding on the overall structure of the network based on the characteristics of the data and the complexity of the problem.
- Configure the Network: Specify the hyperparameters of the neural network, such as the learning rate, batch size, optimization algorithm, and regularization techniques. These hyperparameters control the training process and can significantly impact the performance of the network.
- Initialize the Weights and Biases: Initialize the weights and biases of the network with small random values or using predefined initialization techniques. Proper initialization of these parameters is crucial for ensuring that the network can learn effectively during training.
- Train the Network: Train the neural network using the training data collected in step 1. During training, the network iteratively adjusts its weights and biases based on the input-output pairs in the training data to minimize a chosen loss function. This process typically involves forward propagation to compute predictions, backpropagation to compute gradients, and optimization algorithms to update parameters.
- Validate the Network (Post-training Analysis): Evaluate the performance of the trained network on a separate validation dataset to assess its generalization ability. This step involves analyzing metrics such as accuracy, loss, precision, recall, and F1-score to determine how well the network performs on unseen data and whether it has overfit or underfit the training data.
- Use the Network: Once the network has been trained and validated, it can be deployed for making predictions or performing tasks on new, unseen data. The trained network can be integrated into applications, systems, or pipelines to provide solutions to real-world problems. Additionally, the network's performance can be continuously monitored and improved over time as more data becomes available or as the problem requirements change.
Wine Classification with Pre-defined Neural Network
Lecture 4: Convolution Neural Network
卷积,两个函数的乘积,加权求和,近似极限加权求和
卷积核,Kernel,一般是3*3的矩阵,卷积核是翻转后的g函数,所以运算时候直接做乘积求和。
通过不同卷积核,能够提取出图像不同特征
图像的本质是像素点位置,灰度值和RGB值
提取出来的特征怎么使用是另一个问题
为什么是3*3的矩阵,这样的计算不是把周围的像素也牵扯进来了吗,就是要这样,需要周围的像素点对于该位置像素点的影响。

Concepts of CNN
- Receptive field: A region of the original image corresponding to a pixel of the feature map of a filter or kernel
- Feature map: The output of convolution operations
- Stride: The step length of convolution operations
- Fsize: The size of convolution kernels or filters
- Padding: The filled region of an image boundary
- Top to down: From a deep layer to its next layer
- Anchor boxes: pre-defined bounding boxes help algorithms detect objects at various scales and aspect ratios
Layers in CNN
Three of the most common layers are convolution, activation or ReLU, and pooling.
- Convolution puts the input images through a set of convolutional filters, each of which activates certain features from the images.
- Rectified linear unit (ReLU) allows for faster and more effective training by mapping negative values to zero and maintaining positive values. This is sometimes referred to as activation, because only the activated features are carried forward into the next layer.
- Pooling simplifies the output by performing nonlinear downsampling, reducing the number of parameters that the network needs to learn.
These operations are repeated over tens or hundreds of layers, with each layer learning to identify different features.
Shared Weights and Biases
- Unlike a traditional neural network, a CNN has shared weights and bias values, which are the same for all hidden neurons in a given layer.
- This makes the network tolerant to translation of objects in an image. For example, a network trained to recognize cars will be able to do so wherever the car is in the image.
Classification Layers
R-CNN (Region-based CNN) is a two-stage detection algorithm:
- The first stage identifies a subset of regions in an image that might contain an object.
- The second stage classifies the object in each region.
- Faster R-CNN detector adds a region proposal network (RPN) to generate region proposals directly in the net instead of using an external algorithm like Edge Boxes
SSD: Single Shot MultiBox Detector
- SSD is similar to the Faster R-CNN and simultaneously produces a score for each object in each box.
- SSD skips the proposal step and predicts bounding boxes & confidences scores for multiple classes.
- SSD uses default bounding boxes of different aspect ratios on each location from multiple feature maps.
YOLO
- YOLO sees the entire image during training and testing time so that it encodes contextual information of all classes.
- YOLO predicts what objects present and where they are.
- A single convolutional network simultaneously predicts multiple bounding boxes and class probabilities for those boxes.
Instance Segmentation
Instance segmentation is an enhanced type of object detection that generates a segmentation map for each detected instance of an object. Instance segmentation treats individual objects as distinct entities, regardless of the class of the objects. In contrast, semantic segmentation considers all objects of the same class as belonging to a single entity.
Lecture 5
Vocabulary
- residual connections: 剩余链接,Resnet中也称为shortcut connections
- saturated:饱和的
- degradation problem:降解问题
Degradation Problem
In deep learning, with the network depth increasing, accuracy will get saturated.
The behavior of activation functions and their derivatives directly influences how gradients propagate through the network. Activation functions with very small derivatives can cause gradients to shrink significantly during backpropagation, leading to the vanishing gradient problem. Conversely, activation functions with very large derivatives can cause gradients to explode, leading to the exploding gradient problem.
DenseNets
- To preserve the feedforward nature, DenseNets use direct connections from any layer to all subsequent layers.
- DenseNets alleviate the vanishing
gradient problem
, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters
- blocks: Each layer takes all preceding feature maps as input
ResNets
- ResNets are easy to be optimized for
degradation problem
- ResNets can easily enjoy accuracy gains from greatly increased depth.
- Insert shortcut connections to convert a plain network to ResNet.
- The identity shortcuts can be directly used if the input and output have the same dimensions.
Matlab Deep Neural Network
- AlexNet is a convolutional neural network that is trained based on more than a million images from the ImageNet
- AlexNet is eight layers deep and can classify images into 1,000 object classes, such as keyboard, mouse, pencil, and many animals.
- VGG-19 is a convolutional neural network that is trained based on more than a million images from the ImageNet
- GoogLeNet is a pretrained convolutional neural network that is 22 layers deep
- Inception-v3 is a CNN network that is trained based on more than a million images from the ImageNet
Siamese Neural Network
- pairwise similarity: loss function = sim(x1-x2) - 1, two same class’ similarity is 1.
- Triplet loss: anchor - positive, the smaller the better; anchor - negative the bigger the better;
Supervised Learning
- Test samples are never seen during the training
- Test samples are from the known class of the training
Few-Shot Learning
- Test samples are never seen during the training
- Test samples are not from the known class of the training
- The class of the ‘query set’ is in the ‘support set’
- k-way means class number in support set, accuracy decreases when k increase
- n-shot means sample number in the class, 0 shot, 1 shot, 2 shot; accuracy increase when n increase
Lecture 6
In-context learning
Few-shot learning is a general machine learning approach that uses parameter adaptation to learn the best model parameters for the task with a limited number of supervised examples (Wang and Yao, 2019). In contrast, In-context learning does not require parameter updates and is directly performed on pretrained LLMs.
Multi-domain Few-shot Learning:
In single-domain few-shot learning, the support set comprises cats and dogs, and the query image is classified as either belonging to a cat or a dog. This scenario remains within a single domain due to the similarity between cats and dogs. However, if we include additional domains such as airplanes and ships alongside cats and dogs, it becomes a multi-domain setting. One of the key aspects in multi-domain few-shot learning is leveraging the correlations within each domain
Lecture 7 LSTM
1. Capabilities of LSTM Networks:
- Learning Long-Term Dependencies: Unlike standard RNNs, which struggle with vanishing and exploding gradients, LSTMs are designed to remember information for long durations. This makes them excellent at learning from data where past information is key to understanding the future.
- Handling Sequences of Variable Length: LSTMs can process input sequences of varying lengths, which is beneficial for many types of data such as text, speech, and time series.
- Working With Time-Series Data: They can predict future values in a sequence, making them ideal for forecasting tasks in finance, weather, and other fields.
- Sequence Classification: LSTMs can classify entire sequences into categories. This is useful in sentiment analysis, video classification, and other applications where the entire input sequence is used to determine a single output.
- Sequence Generation: They can generate new sequences that are similar to those they have been trained on. This ability is used in text generation, music composition, and chatbots.
- Sequence to Sequence Learning (Seq2Seq): LSTMs can be used for tasks that require mapping input sequences to output sequences. This is common in machine translation, speech recognition, and other natural language processing tasks.
2. Common Applications of LSTM Networks:
- Natural Language Processing (NLP):
- Text Generation: Generating coherent and contextually relevant text based on a given input.
- Machine Translation: Translating text or speech from one language to another.
- Sentiment Analysis: Determining the sentiment expressed in a piece of text.
- Speech Recognition: Transcribing spoken language into text. LSTMs can handle the temporal dynamics of speech which makes them suitable for this task.
- Music Generation: Creating new pieces of music that mimic the style of the training data.
- Financial Forecasting:
- Predicting stock prices, market trends, and economic indicators based on historical data.
- Risk assessment and anomaly detection in financial transactions.
- Healthcare:
- Patient monitoring systems that predict changes in patient condition based on medical records.
- Analysis of medical imaging data.
- Video Processing:
- Action recognition in videos where understanding the sequence of frames is crucial.
- Anomaly detection in surveillance video.
3. LSTM Practice: Sequence Classification Using Deep Learning https://au.mathworks.com/help/deeplearning/ug/classify-sequence-data-using-lstm-networks.html
1. Why Use LSTM to Classify Waveforms
LSTMs are particularly advantageous for classifying waveforms due to their ability to capture long-term dependencies in time-series data. Waveforms, like those in audio, electrical signals, or any periodic data, often have patterns that unfold over time, and understanding the entire sequence is crucial for accurate classification. LSTMs excel in such scenarios because they:
- Retain information from earlier in the sequence, which is essential for recognizing entire patterns.
- Handle variations in the length of sequences effectively, a common characteristic in waveform data.
2. Necessity of Data Sorting
Sorting data by sequence length before batching is crucial for efficient training:
- Minimize Padding: Sorting reduces the variability of sequence lengths within each batch, minimizing the amount of padding required and hence reducing computational waste.
- Enhance Computational Efficiency: With less padding, each training step is more efficient, as the network spends less time processing zero-padded data.
The sorting is only required within each batch because each batch needs to consist of sequences of the same length for the matrix operations used in training the neural network.
3. Mini-Batch and Epoch Explanation
- Mini-Batch: Refers to processing a subset of the training data at one go. This approach balances the training speed and memory efficiency, and can help the model generalize better by providing a stochastic estimation of the gradient.
- Epoch: Refers to one complete pass through the entire training dataset, including all batches.. Using multiple epochs allows the model to learn incrementally, refining its weights as it sees the data multiple times.
4. Choosing Different Output Modes
The choice of
OutputMode
in LSTM depends on the specific requirement of the task:- "last": Used when only the final state is needed for a prediction, such as in sentiment analysis or any other form where the entire sequence is summarized into a single output.
- "sequence": Used when an output is required for each time step of the input, like in time-series forecasting or tagging tasks (e.g., part-of-speech tagging).
- "none": This might be used in complex architectures where the LSTM outputs are not directly used but are passed through other transformations or layers.
5. When to Use BiLSTM vs. LSTM
- BiLSTM (Bidirectional LSTM): Should be used when the context from both past and future is important for understanding the sequence at any point. This is particularly useful in tasks like speech recognition or text processing where the subsequent inputs can provide relevant context for interpreting the current input.
- LSTM: More suited when only the previous context is necessary, or when the future input should not influence the current output. This is typical in real-time applications such as live audio processing or online prediction tasks where future inputs are not available.
4. LSTM Practice: Classify Videos Using Deep Learning
1. Which Layer should be used to extract features:
1. Type of Network
- Convolutional Neural Networks (CNNs): For image and video data, features from different layers of a CNN capture various aspects of the data:
- Early Layers: Typically capture basic features such as edges, colors, and textures. These features are more generic.
- Middle Layers: Start to assemble more complex features that represent parts of objects in images, such as corners or specific textures.
- Deep Layers: Capture high-level features that often correspond to specific objects or scenes. These features are more abstract and task-specific.
2. Task Complexity
- Simple Tasks: For less complex tasks or when the objects of interest are common and not deeply embedded within the scene, features from earlier or middle layers may suffice as they provide more generic visual descriptors.
- Complex Tasks: For more complex tasks, such as distinguishing between highly similar categories or detailed scene analysis, deeper layers are preferable as they encode more abstract and detailed aspects of the input data.
3. Experimentation and Validation
- Empirical Testing: Often, the best way to determine the most effective layer for feature extraction is through empirical testing. This involves experimenting with features from different layers and evaluating their impact on the performance of the specific task.
- Cross-Validation: Use cross-validation to assess the robustness of the chosen features across different subsets of your data to ensure generalizability.
4. Specific Characteristics of Layers
- Fully Connected Layers: In many networks, fully connected layers (towards the end of the network) are used to make final predictions and may represent too specific features that are closely tailored to the original training task of the network.
- Pooling Layers: Features from pooling layers are often more robust and invariant to the exact positioning of features within the image, which can be useful for tasks requiring recognition of objects regardless of their specific location in the image.
5. Using Pre-trained Models
- Model-Specific Literature: Look for academic papers, benchmarks, and community forums discussing the specific model you are using. Insights from these sources can provide valuable information on which layers have been found most useful for tasks similar to yours.
- Frameworks and Tools: Use tools and frameworks that allow for easy experimentation with different layers. Many deep learning frameworks, like TensorFlow and PyTorch, provide utilities to access intermediate layers of pre-trained models effortlessly.
Practical Example:
- Image Classification: Deeper convolutional layers (just before the fully connected layers) might be best as they capture the essence of the images that are crucial for distinguishing between complex categories.
- Object Detection and Localization: Features from the last convolutional layers are often used because they provide a good balance between high-level semantics and detailed spatial information, which is essential for locating objects.
2. How to use temp dir and data
Explanation of Key Parts:
- Temporary File Storage: The
tempFile
is set to a path in thetempdir
, which is the system's designated temporary directory. This ensures that the data does not permanently occupy disk space and can be cleared easily.
- Existence Check: The script checks if the file exists using
exist(tempFile, 'file')
. If it does, it loads the data directly, skipping the processing steps, which saves time especially when working with large datasets.
- Data Processing and Saving: If the data does not exist, the videos are read, cropped, processed through the neural network to extract features, and the resulting data is saved in the temporary file for future use.
- Usage: After loading or processing, the
sequences
data, which contains feature vectors for each video, is ready for further analysis or training tasks.
3. Assemble CNN and the LSTM
Here's how folding and unfolding typically work in a neural network designed to process video data:
- Sequence Input: A sequence of video frames enters the network.
- Initial CNN Processing: Early CNN layers process each frame for basic feature extraction.
- Sequence Folding: These features are then "folded," allowing deeper CNN layers (which don’t handle sequences) to process each frame's features as if they were part of a batch of unrelated images.
- Further CNN Processing: Additional CNN layers apply more complex transformations to these features.
- Sequence Unfolding: The unfolded operation is then performed to rearrange these features back into a sequence format.
- LSTM Processing: Finally, LSTM layers analyze the sequence of features to capture and interpret temporal patterns and relationships.
序列折叠层(Sequence Folding Layer)
序列折叠层用于在希望独立地对视频的每一帧应用 CNN,同时需要将视频处理为一系列帧的情况。以下是其工作原理:
- 目的:折叠层的作用是将一系列特征图(每一帧图像通过 CNN 层处理后的输出)转换成可以作为独立图像处理的一批特征图。基本上,它重新组织数据,暂时折叠时间维度(时间或序列顺序),以便 CNN 层可以像处理单独图像一样处理每一帧。
- 过程:它接收一系列图像的输出(每个图像已通过 CNN 的初步层处理)并将这些输出重新排列成二维格式。这允许后续的 CNN 层(不原生处理序列的层)处理数据。在这一阶段,每一帧的特征都被独立处理。
序列展开层(Sequence Unfolding Layer)
在 CNN 独立处理每一帧之后,你需要恢复序列结构以便用 LSTM 分析时间动态,这就是 序列展开层发挥作用的地方:
- 目的:展开层通过撤销折叠层所做的变换,恢复时间维度。这种序列格式的重新建立对于 LSTM 层非常必要,因为 LSTM 需要以序列的形式处理数据,以捕捉时间关系和动态。
- 过程:它接收处理后的特征图批次(来自 CNN)并将它们重新组织回原始的序列格式。现在,序列中的每个元素代表经过 CNN 处理的一帧的特征,适合输入到 LSTM 层。
- Author:wenyang
- URL:https://www.wenyang.xyz/article/deepl
- Copyright:All articles in this blog, except for special statements, adopt BY-NC-SA agreement. Please indicate the source!