Professional Documents
Culture Documents
ISSN No:-2456-2165
III. METHODOLOGY
In the first state, the MAC units of the first layer are A. Clock Cycles
active. The input pixels are multiplied by their For an arbitrary network (M x N x K x L) Clock
respective weights, and the resulting weighted sum is cycles and number of computations can be generalized
passed through LUTs to derive H and Hbar as follows: If we consider a network with dimensions
simultaneously for all 32 neurons. 784 x 32 x 10, the number of cycles required or forward
In the second state, H and Hbars are sequentially propagation is 816, which can be generalized as
multiplied by the output layer’s weights, commencing M+N+K. For back propagation, the number of cycles
from 0 to 31. The resulting weighted sum is then fed required is 848, which can be generalized as K +
through LUTs to derive O and Obar for each of the 10 (K+N) + (N+M). Therefore, the total number of cycles
output neurons. At the end of this state, the error2 in required is M+N+K+ K+ (N+K)+(N+M).
the output layer is evaluated using 10 readily available
subtractors, and delta2 value is determined using 10 B. Computational Units
multipliers. These operations are combinatorial and do In a 784 x 32 x 10 network, there are 42 MAC units,
not require any additional clock cycle. 10 subtractors, one adder with ten operands, 42
In the third state, output layer registers are serially loaded multipliers, and 42 LUTs. If we generalise for a network
with weights from weight2 memory. This enables the with dimensions M x N x K x L, the required number of
learning rate, delta2, and H product to be added to the MAC units, multipliers, and LUTs would be N+K+L,
weight, and the weight is then written back to the required number of subtractors would be L, and the
memory. required number of adders with N operands would be
To calculate the hidden layer error in state 4, multiply one, assuming N is greater than K and L.
delta2 and Weights in sequence. Error1 for the present The process of training and testing of MNIST dataset
counter value is the sum of all of these partial products in begins by loading the weight memory and input memory
a single cycle. To accomplish this, an adder with 10 and initializing the values of parameters such as the number
operands is required. At the end of this state, delta1 of hidden layers, neurons in each layer, activation function
values for all 32 neurons are calculated using the for each neuron, and number of epochs. The next step is
available error1 values. to load an image to be trained and perform forward
Within state 5, the input pixel is multiplied by the propagation. The weights are then updated, followed by an
shifted delta1 value, and in one cycle, 32 weights in error calculation. This process is repeated from loading the
the hidden layer are updated. This step is similar to image to be trained to weight updating 100 training
step 3 for weight updation of output layer. images.
Fig. 4: ANN Architecture for 784 x 32 x 10 network for training and testing of MNIST Dataset
Figure 5 shows the simulation results on Xilinx ISE 14.7 for th e training and testing of the MNIST dataset for 100
images.
Fig. 5: Simulation results on Xilinx ISE 14.7 for training and testing of MNIST dataset for 100 images
The results of the timing analysis in hardware The proposed hardware-based implementation is
and software are tabulated in the table II, which shows approximately 10 times than the software-based
that the hardware implementation is roughly 10 times implementation while sacrificing some accuracy. However,
faster than the software-based implementation. Speedup the results obtained show that the hardware-based
achieved is 4/0.4403 = 9.08 or approximately 10. implementation is a viable solution for applications where
fast processing times are essential.
Table III and table IV and presents a comparison between the proposed design and an existing implementation in terms of
speed, resource and other parameters. The results indicate that the proposed design outperforms the existing design in terms of
speed, flexibility, power and resource utilization.
Table 3: Comparison of Proposed design with High Level Synthesis based architectures from literature
Furthermore, it is worth noting that the proposed design is more efficient than the existing design, which only
implements the forward propagation part in hardware as shown in table V.
Table 4: Comparison of Proposed design with RTL designed based architectures from literature