 Research
 Open Access
 Published:
A parallel camera image signal processor for SIMD architecture
EURASIP Journal on Image and Video Processing volume 2016, Article number: 29 (2016)
Abstract
An image signal processor (ISP) for a camera image sensor consists of many complicated functions; in this paper, a full chain of the ISP functions for smart devices is presented. Each function in the proposed ISP full chain is designed to handle highquality images. Every function in the chain is fully converted to a fixedpoint arithmetic, and a special function is not used for easy porting to a Samsung Reconfigurable Processor (SRP). Several parallelizing optimization techniques are applied to the proposed ISP full chain for realtime operation on a given 600MHz reconfigurable processor. To verify the performance of the proposed ISP full chain, a series of tests was performed, and all of the measured values satisfy the quality and performance requirements.
Introduction
Image sensors are used in numerous types of image acquisition devices such as digital cameras, camcorders, and CCTV cameras. Recently, their application region has broadened to include smart devices, and the acquired images are not merely for storage but also for interaction between a human and a computer. To satisfy the many goals of image sensors, the role of image enhancement is more important than ever before.
An image signal processor (ISP) is one of the nonoptical devices that enhance the image quality of captured raw images and consists of several image processing algorithms including demosaicing, denoising, and white balancing, as well as other image enhancement algorithms. The latest ISP algorithms that include iterations with adaptive selections according to the image characteristics produce an excellent image quality. The high image quality costs vast amount of calculation, however, and also require complicated adaptive routines that cannot be executed in parallel.
An ISP can be implemented on a dedicated hardware, a generalpurpose processor, or a parallelcomputing processor. A dedicated hardware implementation, however, shows a high image quality and processing performance at the expense of scalability and flexibility, whereas the implementation of an ISP on a generalpurpose processor can be appropriate not only for the high image quality of complicated algorithms, but also for sound scalability and flexibility; however, the implementation cost of the latter is high due to the large computational amount, and a highperformance platform such as a desktop PC is necessary. The high processing performance and low power consumption of a parallelcomputing processor are accompanied by scalability and flexibility for software implementation. The implementation of an ISP algorithm on a parallelcomputing processor, however, requires further optimization for the utilization of multiple processing elements in parallel. The conventional parallelISPoptimization methodology requires the division of the algorithm into data processing parts and control processing parts first, followed by their operation in parallel because of the adaptivity of the ISP algorithm. Very Long Instruction Words (VLIW) architecture can therefore be an easy choice for ISP implementation, even though Single Instruction Multiple Data (SIMD) architecture can exploit a greater extent of parallelism.
The ISP full chain that is suitable for parallel processing is proposed in this paper, and the chain is implemented through an optimization process for SIMD processor architecture to achieve both a high image quality and performance goals. The proposed ISP full chain is shown in Fig. 1.
In Fig. 1, GWA is Gray World Assumption, AHD is Adaptive HomogeneityDirected Demosaicing, BF is Bilateral Filter, AC is Auto Contrast, and LTI is Luminance Transient Improvement.
The way that the highquality images are processed by all of the algorithms that are present in the proposed ISP chain means that there are no iterations in the algorithm to reduce the execution time of the realtime budget [1]. While the basic idea of the algorithm is maintained, the operations in the algorithm have been simplified for easy parallelization on the SIMD architecture; in addition, heavy memory accesses and excessive computational overheads are reduced by limiting the operational ranges. Each complicated special operation is replaced by a simple operation that performs a similar function and the result was verified by experiments.
The proposed parallel ISP algorithm is targeted to run on the Samsung Reconfigurable Processor (SRP) [2–7] that can be configured as an SIMD processor. Numerous highquality image processing algorithms form the basis of each of the functional components of the proposed ISP full chain [8–30]. By increasing the homogeneity of the parallel operations in the ISP algorithms, the proposed ISP algorithm can take advantage of the parallel performance of a SIMD processor while maintaining an image quality that can pass the commercial image quality test of Skype [31]. The proposed ISP can handle the resolution of full HD video (1920 × 1080, 30 frames per second) on a 600MHz SRP that is suitable for smart devices.
This paper comprises the following: Section 2 describes the existing research; Section 3 describes the implementation of the proposed ISP full chain; Section 4 describes the performance verification process and the results of the proposed ISP full chain; and the conclusion is presented in Section 5.
Background research
Algorithms of the ISP full chain
The functions of the ISP full chain mainly support recovering nonexisting pixels, noise reduction, and image enhancement. The proposed ISP full chain consists of white balancing, demosaicing, color correction, color space conversion, denoising, detail enhancement, and gamma correction.
The color images that enter through an image sensor can show colors that are different to those that are seen by the naked eye; to correct this, the White Balance (WB) process can be used. The WB algorithm GWA [8, 9] assumes that the average of the image is gray; similarly, the whitepatch Retinex (WR) algorithm [10] assumes that the maximumintensity pixel is white. Since these assumptions can be statistically false, Iterative White Balancing (IWB) [11] iteratively refines the white pixels while illuminant voting [12] checks the lighting conditions. The GWA is chosen for the proposed ISP, since it allows for an optimal parallelization during implementation that is due to a relatively structured computation compared with the existing algorithms, as follows:
where C represents one of R, G, and B and C _{WB} represents the color value after white balancing.
After the WB process, demosaicing is an algorithm for the production of full RGB channels, which is achieved by the interpolation of the color pixels that are lacking in image sensorcaptured images. Many algorithms including heuristic methods, directional interpolations, frequency domain approaches, waveletbased methods, and reconstruction approaches [13–16] exist; in this study, Adaptive HomogeneityDirected Demosaicing (AHD) [15], a type of directional interpolation method that is commonly used for digital still cameras, was modified and used. A higher image quality is associated with other algorithms like waveletbased methods [16], but they are not suitable for realtime implementation on the reconfigurable processor that is used in this study due to the huge amounts of calculations and iterations. The rough flowchart of the AHD algorithm is shown in Fig. 2.
The directional interpolation of the AHD performs interpolation in the direction of the strongest edge that flows either vertically or horizontally. Finding the direction of the edge depends on the homogeneity of the neighboring pixels that will also be generated. The homogeneity map is defined by Eq. (2), as follows:
where B is a set of the δ distance from (x, y) ∈ X; X is a set of 2D pixel positions; B is defined by Eq. (3); L _{ f } and C _{ f } are in the neighborhood that is established by the distance of the luminance and color in the CIELab color space and are defined by Eqs. (4) and (5), respectively; E is a set of tolerance values and δ, ε _{ L }, ε _{ C } ∈ E; and d _{ L } and d _{ C } are distance functions, where luminance and the ab plane in the CIELab color space are used. A detailed implementation of AHD is introduced in Hirakawa and Parks [15].
Inevitably, the acquired images comprise a variety of noises due to the characteristics of the sensor and converter circuits that are used—especially with the low light of an indoor environment. To remove these noises effectively, highly adaptive noise reduction methods such as the Bilateral Filter (BF) [18–21] or a 3D noise reduction filter [22, 23] can be used. The BF, proposed by Aurch et al. [20] and improved by Tomash et al. [21], is a nonlinear adaptive lowpass filter with variable weighting factors according to the distance and the intensity of the neighboring pixels. Equation (6) shows the BF:
where the normalization term W(x _{ p }, y _{ p }) is defined in Eq. (7):
In Eqs. (6) and (7), (x _{ p }, y _{ p }) is the location of the center pixel, (x _{ q }, y _{ q }) is the location of the neighboring pixel I(x _{ p }, y _{ p }), I(x _{ q }, y _{ q }) represents the intensities of the corresponding pixels, G _{ S } is the Gaussian function for the spatial domain, and G _{ I } is the Gaussian function for the intensity domain. The proposed ISP uses a modified BF that can flatten the noise area while preserving the edge information.
For an improved subjective image quality, it is necessary to enhance the contrast and edge information; to improve the image contrast, auto level, AC, and histogram equalization are examples of the methods that can be used [24]. The proposed ISP full chain includes the AC function that comprises a relatively low color distortion and less complex operations; in addition, Luminance Transient Improvement (LTI) and Chrominance Transient Improvement (CTI) are also applied to enhance the edges of the luminance and chrominance, respectively [25, 26]. For LTI and CTI implementation, the difference of Gaussian method [27] is used because of the relatively simple corresponding operations and an excellent edge extraction performance. The difference of Gaussian method is represented by Eq. (8):
where O is the enhanced signal, I is the input luminance signal, g _{1} and g _{2} are two Gaussian filters with the variances σ _{1} and σ _{2}, the symbol “∗” is the 2D convolution operator, x is the row number, and y is the column number.
The color correction function changes an entire color according to the desired color temperature. In the proposed ISP full chain, color correction is combined with color conversion to reduce the redundant memory accesses. The applied color correction matrix is shown in Eq. (9), as follows:
where C _{rr} through to C _{bb} are the correcting values that will be multiplied by the RGB channels and R _{cc}, G _{cc}, and B _{cc} are the colorcorrected values of the color channels.
While the acquired images are processed in the ISP full chain, several different color spaces are used. Color conversion is a signalprocessing technique for the transformation of the color representation coordinates into another coordinate system where some of the color axes comprise a small correlation, and the application of signalprocessing functions can reduce the incidence of processing errors [30]. In the proposed ISP, the input signal is initially in the RGB space before it is converted into the YC_{o}C_{g} color space, and the input is then subjected to luminancerelated processes; subsequently, the signal is converted back to the RGB space, colorrelated processes are applied to the signal, and the signal is then sent to the output display.
where Y, C _{ o }, C _{ g }, and R, G, B are the pixel values of the YC_{o}C_{g} color space and the RGB color space.
Gamma correction (GC) modifies the linearity of the camera input to match the nonlinearity of the human visual system [28, 29]. If GC is not applied to the acquired images, humans cannot differentiate the immense number of bits that represent the information. The GC can be modeled as Eq. (13); in the proposed ISP, the GC is implemented using a polynomial approximation:
where I ' is output image, I is input image, and γ is gamma value. A is a constant 1 in a common case.
Implementation platform
The proposed ISP is implemented on an SRP in accordance with the test of the preliminary version of the proposed ISP [1]. Since the SRP can support both of the parallel processing modes SIMD and VLIW, the proposed ISP is accelerated by the implementation of numerous key operations so that it can run in parallel. The SRP supports the following three operation modes: SIMD, VLIW, and scalar. As the SRP configuration that is used in this study can process 128 bits at a time with 16 functional units, it supports SIMD configurations that can process four 32bit, eight 16bit, or 16 8bit data at the one time. In the VLIW mode, eight of the function units can be operational at the same time, whereby up to eight operations can be executed in parallel. Since the routing channel of the SRP comprises independent configurations for the SIMD, VLIW, and scalar modes, the three modes cannot be used in combination; however, the SRP can switch among the three operational modes dynamically while the ISP software is processed. While the sequential codes in the complex control sequences of the algorithm run in Scalar mode, the parallel codes of the massive image data processing operation are accelerated in the SIMD mode or the VLIW mode.
The memory access of the SRP should be aligned by 128bit words; therefore, if the data size is not a 128bit word, the data should have an additional buffering stage to ensure an alignment with the 128bit words. The SRP also consists of a single memory port for readandwrite operations; therefore, memoryintensive jobs like lookup table operations cannot be parallelized and they significantly slow down the processing speed.
The VLIW mode of the SRP comprises a greater programming flexibility because data processing operations and control operations can be executed simultaneously in this mode. The control operations often limit the parallelism, however, because of the dependency among the codes and data; alternatively, the SRP often suffers from the lack of data that is processed in parallel in the SIMD mode. Since a lack of parallelism is inherent to the algorithm, the algorithms in the proposed ISP are modified to supply enough parallelism; therefore, the proposed ISP can mostly run in the SIMD mode for a sufficient computational performance. Figure 3 shows the SRP architecture overview.
In Fig. 3, FU is Function Unit, RF is Register Files, VLIW is Very Long Instruction Words, and CGRA is Coarse Grained Reconfigurable Array.
The existing research shows that other algorithms that have been ported on the SRP platform such as the raytracing algorithm [4] comprise lowpower audio processing [5] and 3D graphics [6]. The proposed ISP full chain is designed to work with SIMDstyle parallel processing; however, due to its high parallelism, the proposed design can be used for platforms with other types of microprocessors such as Intel [31], ARM [32], and the TI Digital Signal Processor (DSP) [33].
Intel processors and ARM processors are based on the superscalar architecture that executes multiple instructions at the same time. The performances of the Intel processor platforms are more effective that those of the ARM processor platforms because the former contains a variety of hardware accelerators for multimedia processing (MMX, SSE, etc.); alternatively, ARM processor platforms consume less power than Intel processor platforms, making them suitable for mobile applications. TI DSP platforms comprise VLIW architectures, whereby multiple signalprocessing operations and control operations can be executed in parallel.
Algorithm porting on SRP
A number of optimization technologies were used to improve the computational performance of the ISP full chain on the SRP while the image quality is maintained. The SRP that is used in this study comprises several commands for the efficient use of SIMD arithmetic data. The composition of the SIMD commands is for the processing of the 128bit data of eight 16bit data. The SIMD commands are composed of ADD, SHF (shift), CLIP, MUL (multiply) and ADD, and MUL and SHF functions. Since there is no SIMD command to verify the results after the comparison, the SUB and CLIP commands were used in combination so that the results after the comparison could be available for implementation. The SIMD commands were heavily used in the proposed ISP functions for a high performance.
Module optimization for the proposed ISP
WB
In the proposed ISP, the WB uses the GWA algorithm [9]. The GWA algorithm corrects the colors of an image, assuming that the average color of each RGB channel is gray. Using Eqs. (14) to (16), we calculated the GWA as follows:
where R _{gain}, G _{gain}, and B _{gain} are the color gain values for each of the channels and \( \overline{R} \), \( \overline{G} \), and \( \overline{B} \) are the averages of the pixel values of the corresponding color channels.
In the input Bayer pattern, the number of G pixels is twice those of the R or B pixels. As the WB process requires the average of the entire image, it is possible to use the data of only half of the G pixels without incurring a significant error; therefore, when the GWA was applied, only half of the G pixels were used so that the GWA equaled the calculation amounts of R and B. Since the sum of the entire pixel amount is too large to fit into an integer register, a proper significant figure addition was used to limit the bit number of the sum. If the size of the integer registers that are used for the calculations is too small, the effective numbers become too small while the errors become larger; contrarily, if the integer register size is too large, the processor limitation makes parallelization difficult. For this reason, the sum register size was limited to 32bit and the temporary variables can be stored in the 64bit registers. Since the SRP does not support division, shift operations are used for the WB result. A division by 3 in Eqs. (14) to (16) is simplified by 3/8, which is performed as a multiplication by 3 followed by a shift right by three bits.
Modified AHD
As in Fig. 2, after the WB process is performed in the Bayer pattern of the image sensor, the modified version of the original AHD [15] is used as a demosaicing algorithm. The AHD consists of the following three steps: directed interpolation, homogeneitydirected map creation, and iterative noise filtering. The method for finding the direction of the edge is dependent upon the location and color of the pixel that is to be generated. The width of the variables is 16 bits including three bits for the fractional part. An operational example of the proposed modified AHD is explained in the following section.
In Fig. 4, R, G, and B are the red, green, and blue pixels, respectively, and the number is the pixel position. Figure 4 comprises GBRG, the Bayer pattern of the image sensor that was used for the proposed ISP implementation; based on G44 in the middle, GBRG is composed of a pattern that consists of G44, B45, R54, and G55, and numbering starts from the top left. Equations (17) and (18) represent the horizontal interpolation of the G and R pixels on the B channel where the input B pixel exists. In Eqs. (17) to (27), all of the parameters are the corresponding pixel values of the locations in Fig. 4.
Equations (19) and (20) represent the horizontal interpolation of the G and B pixels on the R channel where the input R pixel exists, as follows:
Equations (21) to (24) represent the horizontal interpolation of the R and B pixels on the G channel where the input G pixel exists, as follows:
The original AHD repeats the calculation vertically, and it also comprises an additional directionselection process after the generation of the homogeneity map according to the calculation of the CIELab color conversion and epsilon parameter. The modified AHD selects the direction immediately after the interpolation of the G channel, and then the R and B channels are interpolated only once; by doing this, the process of selecting a directionbased RGB pixel value is removed to reduce the amount of calculation. The G channel interpolation equation is also modified, as shown in Eqs. (25) to (27):
where H is horizontal direction weight, V is vertical direction weight, and abs() is absolute value function. In Eqs. (25) and (26), and using Eq. (27), the G channel interpolation depends on the results that are obtained by the horizontal and vertical direction calculations of the three tap filters.
An iterative noise filtering is used in the original AHD. The iterative noise filtering is removed for the reduction of operational loads, however, and it is also a redundant operation because it is performed by the modified BF in the next stage.
To minimize the data load for the Bayer pattern images in the modified AHD, the memory area is designated a size that is one column larger than the original image size and the images are read only once. By manipulating the pointer to the start position, boundary processing is not needed at the time of the RGB channel interpolation, and an ordered data loading technique is applied to the vertical filter that is used for the RGB channel interpolation.
When data are loaded from the memory and fed into a filter, some of the data load may be overlapped due to the convolution operation of the filter. The pseudo codes 1 and 2 show the pseudo codes of the data load for horizontal loading and vertical loading, respectively, for the 1 × 3 filter. As shown in pseudo code 1, the buffer size is 128 bits; that is, it consists of eight 16bit data. Once the filtering of an image line is complete, the filtering of the next image line loads a new image line (at code line 8) and the two lines that had been loaded while the previous image line was processed (code lines 6 and 7 are shown at code lines 2 and 3, respectively). To prevent such an overlapping of data loading, the data should be loaded by row unit, and then the required data should only be read by referring to the existing buffer, as shown in pseudo code 2, where the overlapped data load in pseudo code 1 does not exist. This technique, as shown below, is used for modified AHD, modified BF, modified LTI, and other modules in the proposed ISP.

Pseudo code 1:

line i:

1: Buf1[0:127] = Image[i1,j + 0:127]

2: Buf2[0:127] = Image [i,j + 0:127]

3: Buf3[0:127] = Image [i + 1,j + 0:127]

4: OUTPUT[0:127] = C1*Buf1[0:127] + C2*Buf2[0:127] + C3*Buf3[0:127]

5: j = j + 128; goto 1


line i + 1:

6: Buf1[0:127] = Image[i,y + 0:127]

7: Buf2[0:127] = Image [i + 1,y + 0:127]

8: Buf3[0:127] = Image [i + 2,y + 0:127]

9: OUTPUT[0:127] = C1*Buf1[0:127] + C2*Buf2[0:127] + C3*Buf3[0:127]

10: j = j + 128; goto 6



Pseudo code 2:

Before line i:

1: Buf1[0:127] = Image[i1,j + 0:127]

2: Buf2[0:127] = Image [i,j + 0:127]


line i:

3: Buf3[0:127] = Image [i + 1,j + 0:127]

4: OUTPUT[0:127] = C1*Buf1[0:127] + C2*Buf2[0:127] + C3*Buf3[0:127]

5: temp = C1; C1 = C2; C2 = C3; C3 = temp


line i + 1:

6: Buf3[0:127] = Image [i + 2,j + 0:127]

7: OUTPUT[0:127] = C1*Buf1[0:127] + C2*Buf2[0:127] + C3*Buf3[0:127]

8: temp = C1; C1 = C2; C2 = C3; C3 = temp


At the bottom of the image:

9: j = j + 128


where C1, C2, and C3 are the filter coefficients in pseudo code 1 and pseudo code 2.
The PSNR values for the modified AHD algorithm are compared with those of the conventional AHD in Table 1. Kodak lossless true color images were modified to form the GBRG Bayer pattern images that are used for the PSNR comparison. The PSNR differences vary between −0.22 and 1.76 dB, with an average difference of 0.48 dB, while the computational load is significantly reduced.
Color correction and color space conversion
After demosaicing, the color correction block finds the color features and repairs the color artifacts; the color correction is processed by the color correction matrix, and the matrix that was used is shown in Eq. (9). The color correction equation can be calculated in conjunction with the subsequent color space conversion. The proposed ISP, the YC_{o}C_{g} color space, is used because it has a lower correlation among the color channels compared with other color spaces, and it performs integer operations only without any information loss. Because the color correction equation can be combined with the equation of the YC_{o}C_{g} color space conversion, the intermediate process for the storage of the values of the color correction result can be removed, thereby reducing the memory access cycle. The equations for performing the combined color correction and the YC_{o}C_{g} color space conversion are shown in Eqs. (28) to (31):
where Y, C _{ o }, C _{ g }, and R, G, B are the pixel values in the YC_{o}C_{g} color space and RGB color space, respectively. A color control function is also combined in the YC_{o}C_{g} color space conversion to control the color saturation and color offset. A coefficient integerization technique was used for the color correction.
Auto contrast
A linear stretch method is used in the AC, and the linear scale factors in the AC function were calculated in the YC_{o}C_{g} color space. Equations (32) and (33) were applied to the AC that is used in this study:
where R _{ S } is stretch ratio; B _{max} is the maximum value of bit depth; Y _{max} and Y _{min} are the maximum and minimum values of the Y channel, respectively; Y _{in} is the input image; and Y _{out} is the image after AC. For the calculation of the Y _{max} and Y _{min} values, a technique to separate and rearrange the algorithm is used to reduce the memory accesses. First, the Y _{max} and Y _{min} values were calculated by using the Y value that is obtained through the YC_{o}C_{g} color conversion, and the calculated results are used for the AC calculation that is included in the modified BF function. The AC optimization techniques have been designed to shift operations, instead of resulting in the breakage of the coefficient integerization.
Modified BF
BF is used as a noise reduction algorithm [18–21]. The original BF comprises the following two Gaussian filters: one is for the distance weight between pixel locations and the other is for the difference weight between pixel intensities. To simplify these two Gaussian filters, the Gaussian functions are replaced by fixedpoint, binary threshold functions in the proposed modified BF. The threshold values are determined by precalculating the Gaussian filter coefficients for the pixel locations and pixel intensities.
In the proposed ISP algorithm, the BF is simplified to reduce the amount of calculation. Since the Gaussian function requires a special math hardware, G _{ S } and G _{ I } are replaced by the binarization functions B _{ S } and B _{ I }. The size of the spatial domain of G _{ S } in the proposed ISP is 7 × 7. The output of B _{ S } for the same domain size is 1 for a 3 × 3 area and 0 for any others; therefore, the domain S of 7 × 7 is replaced by the new domain S ' of 3 × 3. The B _{ I } that is the binarization of G _{ I } is represented by the following:
where I _{Th} is the threshold value of the pixel value difference. The resulting modified BF is Eq. (35):
where the new normalization term W ' _{ p } is the following:
To further reduce the calculation complexity, the 3 × 3 filter of the domain S ' was replaced by a separable filter that is composed of two 1D filters of the sizes 3 × 1 and 1 × 3; by using this separable filter, the computational complexity of the proposed BF becomes O(n), instead of the O(n ^{2}) of the original BF [34].
When the algorithm is implemented with a 2D filter, the amount of memory access and computation for the SRP needs to be increased quadratically. Instead of the 2D filters in the original BF, the separable filter is applied to the proposed modified BF. By making the 2D filter separable, the computational load of 2D filtering is reduced to twice that of 1D filtering. Figure 5 compares the filtering operation of a 2D filter with that of a 1D filter for the SRP. Due to the SRP structure, all of the data should be stored in buffers before a filter is used; so, when a 3 × 3 filter mask is used, fifteen 128bit registers are needed to start a necessary operation. Alternatively, when a 1D filter is used, it is possible to perform an operation with five 128bit registers for a horizontal filter and three 128bit registers for a vertical filter; therefore, the use of a separable filter also makes it possible to reduce the amount of memory access.
In Fig. 5, a square box represents a single pixel of 16 bits and a buffer has eightpixel data. In addition, the filter size has also been modified from 7 × 7 to 3 × 1 and 1 × 3, and a vertical data loading technique is applied to the 1 × 3 vertical filter.
Detail enhancement
For detail enhancement, an LTI based on the difference of Gaussian [27] is used. The Gaussian mask sizes in the LTI are 3 × 3 and 5 × 5, which are with precalculated coefficients. Since CTI rarely affects image quality, a simple 1 × 3 Laplacian sharpening filter is configured for the CTI. The separable filters are implemented for LTI and the filter size is adjusted. As the filter that is used here is also a vertical filter, a vertical data loading technique was applied.
Gamma correction
For GC, the lookup table method or a piecewise linear interpolation method [29] is used. The input data is used as the index of the lookup table method, while the input range and the linear interpolation parameter are checked from the table for the piecewise linear interpolation method. In this study, instead of using the lookup table that is difficult to parallelize due to a large volume of irregular memory access, the quadratic approximation of the GC equation that utilizes the 128bit data processing of the SRP was used. Equation (37) is the equation that is used for GC:
where k _{1}, k _{2}, and k _{3} are the GC coefficients, x is the pixel value of the RGB channel, and y is the GC result value. The parameters are determined to have the least square error over most of the central region. Since GC is performed for all three of the RGB channels, the algorithm was rearranged to use the results of the YC_{o}C_{g}toRGB color space conversion.
Experiment results
To verify the performance of the proposed ISP full chain, the quality of the result images should first pass a commercially available image quality test such as Skype [35]. The experiments were conducted using a CMOS image sensor with a specification that is shown in Table 2. Figures 6, 7, and 8 are the parts of the test patterns.
Figure 6 is the image quality resolution test pattern, which is used to measure the clearness of luminance images. Figure 7 is for the evaluation of color performance and Fig. 8 is for the verification of texture acuity. Other patterns for the measurement of aspects such as exposure error, gamma, SNR, and dynamic range exist.
Table 3 shows the results of the image quality for the proposed ISP full chain that was implemented on the SRP; as shown in Table 3, all of the measured values meet the requirements of the test. Since the entire proposed ISP chain has been designed only with fixedpoint addition and multiplication, the proposed ISP chain can be easily ported onto any other microprocessor; furthermore, even when the characteristics of a CMOS image sensor change, it is still possible to meet the image quality evaluation standards by simply adjusting the coefficients that are used for the ISP full chain.
The performance goal of the proposed ISP is the processing of full HD image sequences (1920 × 1080, 30 frames per second) with a 600MHz SRP.
Table 4 shows the number of clock cycles that were taken by the modules in the proposed ISP full chain. The number of cycles for sequential processing comprises the cycles that are taken without the use of any SIMD operations, and the number of cycles for parallel processing is the cycles that are taken from the use of the SIMD operations of eight processing elements. The parallelizing speedup by a factor of 4.9 is obtained by dividing the total sequentialcycle number by the total parallelcycle number. The degree of parallelism can be found by using Amdahl’s law of Eq. (38), as follows:
where T _{ s_old} is the time taken by the sequential operations that are not affected by parallelization; T _{ p_old} is the time taken by the sequential operations that are affected by parallelization; T _{ s_new} is the time taken by the sequential operations that are not affected by parallelization after improvement; and T _{ p_new} is the time taken by the parallel operations after improvement.
Since the sequential parts are not affected by parallelization, the processing time does not change after improvement, as shown by T _{ s_old} = T _{ s_new} = T _{ s }. In the proposed ISP, the parallelization is performed by the SIMD with eight processing elements, so T _{ p_new} = T _{ p_old}/8. If T _{ p_new} is assumed as 1, Eq. (38) is changed, as shown in Eq. (39):
By inducing the sequential time T _{ s } from Eq (39), T _{ s } = T _{ s_old} = 0.81. Since the total execution time before parallelization is T _{ s_old} + T _{ p_old} = T _{ s } + T _{ p_new} * 8 = 8.81, the time that is not affected by parallelization is only 9 % of the total sequential execution time, whereby 91 % of the total sequential time is parallelized by the eight processing elements.
Since the resolution of the CMOS image sensor that is used in the experiment is larger than that of the target performance, the conversion to get the performances of the full HD image sequences is shown in Eq. (40), as follows:
where C _{ s } is cycles per second, CPP is cycles per pixels, Res is target resolution, C _{ f } is simulation cycles, and TP is the number of test image pixels. Since the input resolution of the test camera is 2624 × 1956, the total number of cycles to handle an image of a 1920 × 1080 resolution was recalculated; therefore, the SRP simulation result satisfies the realtime operation of the target for the 600MHz SRP.
Table 5 shows the performances of the proposed ISP on other platforms in cycles per pixel. To compare the parallelization performances of the proposed ISP algorithm in a test, widely used, commercial processor platforms were used to run the proposed ISP full chain. For the test platform, generalpurpose desktop processors of the Intel processor family, a generalpurpose mobile processor of the ARM Cortex family, and a signalprocessing VLIW processor of the TI DSP family were chosen; for the TI platform and the SRP platform, the simulators that were provided by the manufacturers were used in the experiments. Since each platform comprises a different operating frequency, the cycles per pixel were calculated for the purpose of comparison. The proposed ISP full chain was compiled for a single processor because the communication overhead for multiple threads can abuse the efficiency of parallel operations. All of the platforms comprise the multipleissue pipelines and multimedia instructions of the SIMD style [36–38], and the optimization options were disabled for comparison purpose because the SRC compiler does not have optimization options.
For faster porting, the cycleaccurate simulators for the TI C64x + and SRP were used. The operating frequencies of the commercial TI C64x + processors are between 500 and 1200 MHz; for the proposed algorithm, the target platform of the SRP processor was designed to run at 600 MHz.
The results in Table 6 show the values of cycles per pixels obtained by using the compiler optimization option along with the proposed ISP full chain. In the case of SRP platform, the SRP compiler does not provide optimization options.
Using GCC compiler, for Intel and ARM platform, option O1 allows for branch, register, and tree optimization. Similarly option O2 (default option in GCC) allows align, local, and global optimization, while option O3 allows all the abovementioned optimization for O1 and O2 along with the parallelizing optimization for loop unrolling and loop vectorization. Although the use of option O3 does not guarantee speedup as compared to the use of option O2 [39], the application of optimization option O3 along with the proposed ISP full chain achieves higher speedup due to the inherent higher degree of parallelism.
The TI platform also allows the use of optimization options for the TI compiler. Option O1 in the TI compiler is used for register usage optimization, option O2 is used for global optimization including parallelizing optimization such as software pipelining, loop optimization, and loop unrolling, and option 3 is used for optimization related with inline calls to small functions and reorder function declarations. Again, using the optimization option O2 along with the proposed ISP full chain achieves higher speedup compared to other options due to higher degree of parallels.
Since the SRP can process eight 16bit operations in parallel with a single SIMD instruction, the SRP outperforms the fastest commercial platform i7 by 3.36 times at the fully optimized version in Table 6. Although Intel platforms comprise an issue width of 4 and 4 × 16bit data SIMD instructions, the inefficiency of the dynamic scheduling and a lower memory bandwidth limit exploit the parallelism of the proposed algorithm; that is, the parallelism of the proposed algorithm can also work for Intel platforms. With respect to the ARM platform, its issue width is half that of the Intel platform and it comprises an even lower memory bandwidth, so the cycleperpixel value is 4.07 times higher than those of the Intel platforms. The TI platform also comprises two issue pipelines, but there are more operation slots for control operations; however, the proposed algorithm is designed for data parallelism, and the performance gain over the ARM platform is marginal.
Conclusions
In this study, a parallel version of the ISP full chain is proposed and implemented on an SRP architecture with eight data width SIMD instructions. The proposed ISP full chain is written in C language for portability, and the image quality was verified with a commercially available test suite. The proposed algorithm was modified for lesser computational loads and a capability that facilitates the easy exploitation of parallelism. A variety of optimization techniques were also applied to make the algorithm suitable for an SIMDstyle architecture. The experiment results satisfy both the image quality standard and the realtime operation speed for a 600MHz SRP with full HD image sequences, and it utilizes approximately five out of the eight operation slots in the SIMD instruction of the SRP. The parallelism of the proposed algorithm was also tested in a comparison with other commercial platforms, and the results show that it can be easily exploited.
References
 1.
S Choi, J Cho, Y Tai, S Lee, Implementation of an image signal processor for reconfigurable processor, 2014 IEEE International Conference on Consumer Electronics, (IEEE, Las Vegas, NV, 2014), pp. 141–142
 2.
D Suh, K Kwon, S Kim, S Ryu, J Kim, Design space exploration and implementation of a high performance and low area coarse grained reconfigurable processor, 2012 International Conference on FieldProgrammable Technology, (IEEE, Seoul, Korea, 2012), pp.6770
 3.
Y Cho, S Jeong, J Jeong, H Shim, Y Han, S Ryu, J Kim, Case study: verification framework of Samsung reconfigurable processor, 2012 13th International Workshop on Microprocessor Test and Verification, (IEEE, Austin, TX, 2012), pp.1923
 4.
J Lee, Y Shin, W Lee, S Yun, J Kim, Realtime ray tracing on coarsegrained reconfigurable processor, 2013 International Conference on FieldProgrammable Technology, (IEEE, Kyoto, Japan, 2013), pp.192197
 5.
S Jin, W Seo, Y Cho, S Ryu, Lowpower reconfigurable audio processor for mobile devices, 2014 IEEE International Conference on Consumer Electronics, (IEEE, Las Vegas, NV, 2014), pp.369370
 6.
K Kwon, S Son, J Park, J Park, S Woo, S Jung, S Ryu, Mobile GPU shader processor based on nonblocking coarse grained reconfigurable arrays architecture, 2013 International Conference on FieldProgrammable Technology, (IEEE, Kyoto, Japan, 2013), pp.198205
 7.
NR Miniskar, PS Gode, S Kohli, D Yoo, Function inlining and loop unrolling for loop acceleration in reconfigurable processors, Proceedings of the 2012 international conference on Compilers, architectures and synthesis for embedded systems, (ACM, New York, NY, 2012), pp. 101–110
 8.
RC Bilcu, Multiframe auto white balance. IEEE Signal Process. Lett. 18(3), 165–168 (2011)
 9.
J Huo, Y Chang, J Wang, X Wei, Robust automatic white balance algorithm using gray color points in images. IEEE Trans. Consum. Electron. 52(2), 541–546 (2006)
 10.
N Kehtarnavaz, H Oh, Y Yoo, Development and realtime implementation of auto white balancing scoring algorithm. J. RealTime Image Proc. 8(5), 379–386 (2002)
 11.
C Weng, H Chen, C Fuh, A novel automatic white balance method for digital still cameras, IEEE International Symposium on Circuits and Systems, (IEEE, Kobe, Japan, 2005), pp. 3801–3804.
 12.
G Sapiro, Color and illuminant voting. IEEE Trans. Pattern Anal. Mach. Intell. 21(11), 1210–1215 (2002)
 13.
BK Gunturk, J Glotzbach, Y Altunbasak, RW Schafer, RM Mersereau, Demosaicking: color filter array interpolation. IEEE Signal Process. Mag. 22(1), 44–54 (2005)
 14.
D Menon, G Calvagno, Color image demosaicking: an overview. Signal Process. Image Commun. 26(8–9), 518–533 (2011)
 15.
K Hirakawa, TW Parks, Adaptive homogeneitydirected demosaicing algorithm. IEEE Trans. Image Process. 14(3), 360–369 (2005)
 16.
BK Gunturk, Y Altunbasak, RM Mersereau, Color plane interpolation using alternating projections. IEEE Trans. Image Process. 11(9), 997–1013 (2002)
 17.
M Fairchild, Color appearance models, 2nd edn. (John Wiley & Sons, New York, NY, USA, 2005)
 18.
S Paris, F Durand, A fast approximation of the bilateral filter using a signal processing approach, 9th European Conference on Computer Vision, (Springer, Graz, Austria, 2006), pp. 566–580
 19.
T Melange, M Nachtegael, EE Kerre, Fuzzy random impulse noise removal from color image sequences. IEEE Trans. Image Process. 20(4), 959–970 (2010)
 20.
V Aurich, J Weule, Nonlinear Gaussian filters performing edge preserving diffusion, 17th DAGMSymposium Mustererkennung, (SpringerVerlag London, Bielefeld, Germany, 1995), pp.538545
 21.
S Paris, P Kornprobst, J Tumblin, F Durand, Bilateral filtering: theory and applications, (Now Publishers Inc, Boston, MA, USA, (2009)
 22.
O Ghita, DE Ilea, PF Whelan, Adaptive noise removal approach for restoration of digital images corrupted by multimodal noise. IET Image Process. 6(8), 1148–1160 (2012)
 23.
R Manduchi, C Tomasi, Bilateral filtering for gray and color images, Sixth International Conference on Computer Vision, (IEEE, Bombay, India, 1998), pp. 839–846
 24.
N Sengee, A Sengee, H Choi, Image contrast enhancement using bihistogram equalization with neighborhood metrics. IEEE Trans. Consum. Electron. 56(4), 2727–2734 (2010)
 25.
T Kim, J Paik, Adaptive contrast enhancement using gaincontrollable clipped histogram equalization. IEEE Trans. Consum. Electron. 54(4), 2803–2810 (2008)
 26.
J Cho, J Bae, Color transient improvement with transient detection and variable length nonlinear filtering. IEEE Trans. Consum. Electron. 54(4), 1873–1879 (2008)
 27.
Y Wang, X Chen, H Han, S Peng, Video luminance transient improvement using differenceofGaussian, 15th AsiaPacific Conference on Communications, (IEEE, Shanghai, China, 2009), pp.249253
 28.
GGC Holst, TS Lomheim, CMOS/CCD sensors and camera systems, 2nd edn, (SPIE, Bellingham, WA, 2007)
 29.
ES Kim, SW Jang, SH Lee, TY Jung, KI Sohng, Optima piece linear segments of gamma correction for CMOS image sensor, IEICE Trans. Electron, 88C(11), 20902093(2005)
 30.
B Pham, G Pringle, Color correction for an image sequence, IEEE Comput. Graph. Appl. Mag., 15(3), 3842(1995)
 31.
HW Certification Team, Skype hardware certification specification for all Skype video devices version 5.0, http://www.imatest.com/wpcontent/uploads/2011/11/Skype_certdesktopapispecvideo_5.0.pdf, accessed 15 Jan 2013
 32.
JP Shen, MH Lipasti, Modern processor design fundamentals of superscalar processors, (Waveland Press Inc, Illinois, 2013)
 33.
S Furber, ARM systemonchip architecture, 2nd edn. (AddisonWesley, Boston, 2000)
 34.
N Seshan, High VelociTI processing [Texas Instruments VLIW DSP architecture], IEEE Signal Process. Mag., 15(2), 86–101,117(1998)
 35.
F Porikli, Constant time O(1) bilateral filtering, 2008 IEEE Conference on Computer Vision and Pattern Recognition, (IEEE, Anchorage, AK, 2008), pp. 1–8
 36.
D Levinthal, Performance analysis guide for Intel® Core™ i7 processor and Intel® Xeon™ 5500 processors, https://software.intel.com/enus/articles/processorspecificperformanceanalysispapers, accessed 15 March 2015
 37.
Technical Reference Manual, CortexA9 MPCore, https://developer.arm.com/products/processors/cortexa/cortexa9, accessed 15 Mar 2015
 38.
SPRU732J, TMS320C64xC64x + DSP CPU and instruction set, http://www.ti.com/lit/ug/spru732j/spru732j.pdf, accessed 15 March 2015
 39.
K Hoste, L Eeckhout, Cole: compiler optimization level exploration, Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization, (IEEE/ACM, New York, NY, USA, 2008), pp.165174
 40.
D Xu, Study of MTF measurement technique based on special object image analyzing, 2012 International Conference on Mechatronics and Automation, (IEEE, Chengdu, China, 2012), pp.21092113
Acknowledgements
This research was partly supported by Samsung Electronics; the MSIP (Ministry of Science, ICT and Future Planning), Korea, under the ITRC (Information Technology Research Center) support program (IITP2016H8501161005) supervised by the IITP (Institute for Information and Communications Technology Promotion), and the Research Grant of Kwangwoon University in 2015.
Authors’ contributions
SC implemented the proposed ISP algorithm, participated in its whole experiments, and drafted the manuscript. JC and YT provided the guides on the performance requirements and the design of experiments and gave the feedbacks on the experimental results. SL conceived of the study and participated in its design and coordination and helped to draft the manuscript. All authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Author information
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Choi, S., Cho, J., Tai, Y. et al. A parallel camera image signal processor for SIMD architecture. J Image Video Proc. 2016, 29 (2016). https://doi.org/10.1186/s1364001601372
Received:
Accepted:
Published:
Keywords
 CMOS image sensor
 Image signal processor
 Reconfigurable processor
 Parallel processing optimization