- Research
- Open Access
An improved algorithm based on Bloom filter and its application in bar code recognition and processing
- Mai Jiang^{1}Email authorView ORCID ID profile,
- Chunsheng Zhao^{1},
- Zaifeng Mo^{1} and
- Jing Wen^{1}
https://doi.org/10.1186/s13640-018-0375-6
© The Author(s). 2018
- Received: 26 July 2018
- Accepted: 8 November 2018
- Published: 6 December 2018
Abstract
In many cases, databases are incompetent to meet the requirement of the quick query identification and processing of bar codes, such as the automatic sorting system of giant logistics warehouse. Bloom filter can be faster than databases, but its high false positive rate may seriously affect the efficiency of work. Although increasing the width of bit vector and the number of hash functions can reduce the false positive rate, the effect will be not significant after a certain threshold value, and this approach will increase the cost of processing time. So, it could not be increased indefinitely. This paper presents an improved algorithm based on Bloom filter and its application in bar code recognition and processing. The bit vector of Bloom filter is divided into two parts. Every element a_{i} could be mapped to a part of the bit vector by some hash functions. For each element to amplify the difference by g (), which makes g (a_{i}) = a^{*}_{i}, the a^{*}_{i} is mapped to another part of the bit vector by some hash functions too. This algorithm can significantly reduce the false positive rate of the Bloom filter, but does not increase much time and space costs.
Keywords
- Bloom filter
- Hash function
- False positive rate
1 Introduction
In many cases, the bar codes need to be quickly identified and processed in a timely manner. The traditional way is to scan the bar codes and then read the database to identify the bar codes; but on some occasions, such modern logistics warehouse of giant automatic sorting system using databases often cannot satisfy the requirement of bar codes recognition speed. Especially nowadays, with the increasing popularity of online shopping, many logistics systems are processing more and more data, which puts a lot of pressure on the database. The rapid identification of bar codes using Bloom filter is a good option. However, Bloom filter has a defect that there is a false alarm rate, which is a bad situation for the bar code’s recognition system that needs to be quickly identified and processed. Even sometimes false positives can lead to a lot of later processing time costs. Although the false positive rate could be reduced by increasing the length of the bit vector of the Bloom filter and adding the number of hash functions, the cost of time and space will also be increased. However, in systems that require quick recognition, the increasing of time and space is often restricted. In addition, the algorithm implementation of various classical hash functions of Bloom filter can easily result in the ignoring of the slight difference of similar inputs, which leads to the improvement of false positive rate. The bar code itself is often very different. In this paper, an improved Bloom filter algorithm is proposed to reduce the false positive rate and ensure the speed of bar code recognition. The algorithm proposed in this paper is to reduce the false positive rate of Bloom filter.
This paper is in the same direction with the author’s another paper [1]. However, this paper differs from the reference [1] in the following aspects and improvements. Firstly, this paper analyzes the reasons why the classical hash functions in practical applications tend to ignore the differences of similar elements, which is why similar elements are more likely to cause conflicts. The arguments of previous articles have been based on non-existent perfect hashing functions. Secondly, a new idea is proposed in this paper, that is, the hashing process is carried out after the element has been transformed to solve the problem of high conflict rate of Bloom filter caused by approximate input elements. This transformation is the differential amplification transformation. Finally, based on the proposed algorithm, this paper presents a detailed scheme for bar code recognition and processing.
We briefly introduce the results and discussion in Section 2. In this section, the research results are briefly introduced. Bloom filter and its current common usage are discussed. We describe Bloom filters in detail, and we give a hopefully precise picture of space/computing time/error rate trade-offs. An improved algorithm is introduced. In section 3, an application scheme of the algorithm in bar code recognition is introduced. The last section is the conclusion.
2 Results and discussion
An improved Bloom filter is proposed in this paper. This algorithm can effectively reduce the false positive rate of Bloom filter, but it does not increase much space and time cost. Moreover, the algorithm is relatively easy to implement. This paper applies the algorithm to bar code recognition and processing and designs a specific scheme. According to the statistical effect of the test, the algorithm reduces the false positive rate and achieves better test results. At the same time, the bar code’s automatic identification scheme designed in this paper has a favorable processing speed.
2.1 Bloom filters
Bloom filter [2] was proposed by Burton Howard Bloom in 1970. It is a compact data structures for probabilistic representation of a set in order to support membership queries (i.e., queries that ask: “Is element X in set Y?”). This compact representation is the payoff for allowing a small rate of false positives in membership queries. That is, queries might incorrectly recognize an element as a member of the set.
If it is needed to determine whether an element is in a collection, the traditional solution is to save all the elements and then compare them. The data structures such as linked lists, trees, and so on, are the same idea. But with the increasing of the elements in the collection, the storage space that they need becomes larger and larger, and the retrieval speed becomes slower and slower (O(n),O(logn)).
Bloom filter has the advantage that space efficiency and query time efficiency are much better than ordinary algorithms. A hash table can be used to determine whether an element is in a collection or not, and the retrieval is very efficient. Nevertheless, Bloom filter does the same thing with only one fourth or one eighth or even lower of the space complexity of the hash table approach. Bloom filter can insert elements, but non-existing elements can be deleted. Bloom filter is impossible to have a false negative, as long as the elements exist in Bloom filter. However, Bloom filter has one major shortcoming that it can produce false positive, and the more similar elements there are, the greater the false positive rate will be.
2.2 Usage
- A.
Blacklist
- B.
Duplicate URL detection for web crawler
- C.
File or record lookup
- D.
CDN (content delivery network)
- E.
Others
Bloom filter also has some other common uses, such as “Web Cache Sharing,” [4] “Query Filtering and Routing,” [5–8] “Compact Representation of a Differential File,” [9] “Free Text Searching,” [10] “OceanStore”[8, 11]. In summary, Bloom filter is very versatile. Over time, more new applications for Bloom filter will be developed. With the arrival of the big data era, Bloom filter will surely exert more value.
2.3 Details of Bloom filters
2.3.1 Constructing Bloom filters
There’s a set of n elements. Bloom filter uses the bit vector V with length m to describe the membership information of A. Therefore, k hash functions, h_{1}, h_{2}, …, h_{k} with ∀a_{i} ∈ A, h_{i}(a_{i}) ∈ {1. . m1}, are used as described below:
The following program builds a Bloom filter corresponding to set A. The bit vector V of the Bloom filter has m bits, using k hash functions h_{1}, h_{2}, …, h_{k}:
So, if a_{i} is a member of set A, all bits corresponding to the hash value of a_{i} are set to 1 in the resulting Bloom filter V. The following program tests whether an element belongs to set A:
When new elements are added to a collection, their corresponding locations are evaluated by the hash function, and the bits are set in the filter:
If two Bloom filters are to be merged, simple bitwise OR manipulation is performed between two-bit vectors:
2.3.2 Bloom filters—the math
It is assumed that all hash functions are perfect, and they evenly distribute the elements of set A throughout the space {1..M}. In practice, good results are obtained using MD5 and other hash functions [12].
In Eq. (2), P_{err} is minimized for \( k=\frac{m}{n}1\mathrm{n}\kern0.5em 2 \) hash functions.
In Fig. 1, the Bloom filter is 32 bits per item (m/n = 32). At this point, 22 hash functions are used to minimize the false positive rate. However, adding hash functions does not significantly reduce the error rate when more than 10 hash functions have been used.
In Fig. 2, different rows represent different numbers of hash keys. It is worth noting that the error rate of using 32 keys is not significantly an advantage over using only 8 keys.
Equation (3) can be derived from Eq. (2). From Eq. (3), the ratio of the size of the bit vector to the number of elements is obtained. Multiply both sides of Eq. (3) by n, then the size m of the bit vector V can be obtained. In other words, as long as the number of elements, false positive rate, and hash function is determined, the size of the bit vector V can be determined accordingly.
To summarize, Bloom filter is a compact data structure for querying the existence of elements. But it has a false positive rate. To design Bloom filter, we need to weigh the error rate (conflict), the number of hash functions (driver computation overhead), and the size of bit vectors. Formula (2) is the main formula of Bloom filter, which is the analysis basis of Bloom filter.
2.3.3 A modified algorithm
Similar elements bring more false positives
These are all classic hash functions, and they are all based on shifted hash functions. In the process of shift, the low and high bits are discarded. This must result in a loss of precision. For similar elements, the difference between them is small [15]. The difference bits between similar elements may be discarded by the shift operation. This leads to differences in similar elements being ignored [16]. This is also a source of false positives in the actual use of the Bloom filter. The more numbers of similar elements input, the more likely it is to produce false positives. This is determined by the function. Therefore, Bloom filter with differential amplification and rehash is proposed [17].
Constructing new Bloom filters
There is a set A = {a_{1}, a_{2}, …, a_{n}} of n elements. If a_{i} is an element of A, mapping a_{i} to a part of the filter by some hash functions, each element amplifies the difference by function g (), which makes g (a_{i}) = a^{*}_{i}. Bloom filters describe membership information of A using a bit vector V of length m. For this, k hash functions, h_{1}, h_{2}, …, h_{k} with ∀a_{i} ∈ A, h_{i}(a_{i}) ∈ {1. . m1}, and d hash functions, f_{1}, f_{2}, …, f_{d} with f_{j}(a^{∗}) ∈ {m1. . m}_{,} are used as described below:
New Bloom filter—the math
3 Used in bar code identification and processing
3.1 Bar code processing scheme
Nowadays, bar codes are more and more widely used. In many cases, the speed of bar code processing is increasingly required, such as the sorting system of large logistics companies [18]. This section describes a bar code processing scheme based on the improved Bloom filter introduced in this article. This scheme can not only improve the processing speed of bar codes, but also reduce the false positive rate as much as possible.
In this scheme, we (1) assume the image is in the ideal state of identification and verification, (2) cut out the bar code area from a picture, and (3) properly handle the detected bar codes so that it is as ideal as possible. The ideal state is that the bar code generated by code generator has no defilement of the image, both sides of the bar code are horizontal and vertical, and the bar code is positive from left to right.
In this scenario, the bar code area needs to be cropped from the image that contains the bar code to identify the bar code from the area. The obtained bar code will be the input element of the Bloom filter. Of course, the Bloom filter adopted here is the improved Bloom filter introduced in this paper. This solution takes advantage of the high-speed feature of the Bloom filter to automatically recognize the bar code and quickly retrieve it for further processing.
After the bar code area is cropped, the bar code image needs to be corrected, preprocessed, and scanned to obtain the width of the bar and space. The bar code information is then decoded according to the width of the bar and space. The obtained bar code information can be entered into the improved Bloom filter for retrieval, and then further processing can be conducted according to the retrieval results.
A bar code is made up of several numbers, letters, or symbols, and the same company’s bar codes are often only slightly different [19]. These bar codes, as input elements, are equivalent to a large number of similar keys for Bloom filter. When the hash functions of Bloom filter have some similar input elements, they may ignore the small differences, resulting in the hash function producing the same output on the similar elements and generating false positives [20]. Because a company’s bar codes are easy to focus on one section, this will result in more false positives. However, each bar code is unique, and the difference between the two bar codes is certain. Therefore, the characteristic part of the bar code can be extracted for mathematical transformation to indicate the difference between similar elements magnified. Thus, the difference between the keys in the hash function of Bloom filter can be enlarged and the false positive rate can be reduced [21].
Bar code mapping and differential amplification transformation
a | Value of a | Hash value(h_{i} : X → {1. . m1}) |
---|---|---|
a _{ 1} | x1x2x3x4x5x6x7x8x9 | h _{ 1} (a _{ 1} ), h _{ 2} (a _{ 1} ), …,h _{ j} (a _{ 1} ),... ,h _{ k} (a _{ 1} ) |
a _{2} | y1y2y3y4y5y6y7y8y9 | h _{ 1} (a _{ 2} ), h _{ 2} (a _{ 2} ), …,h _{ j} (a _{ 2} ),... ,h _{ k} (a _{ 2} ) |
a _{3} | z1z2z3z4z5z6z7z8z9 | h _{ 1} (a _{ 3} ), h _{ 2} (a _{ 3} ), …,h _{ j} (a _{ 3} ),... ,h _{ k} (a _{ 3} ) |
a _{4} | u1u2u3u4u5u6u7u8u9 | h _{ 1} (a _{ 4} ), h _{ 2} (a _{ 4} ), …,h _{ j} (a _{ 4} ),... ,h _{ k} (a _{ 4} ) |
a _{5} | v1v2v3v4v5v6v7v8v9 | h _{ 1} (a _{ 5} ), h _{ 2} (a _{ 5} ), …,h _{ j} (a _{ 5} ),... ,h _{ k} (a _{ 5} ) |
… | … | … |
Mapping after transformation
a^{*}_{i} = g (a_{i}) | Value of a^{*} | Hash value(f_{j} : X^{∗} → {m1. . m}) |
---|---|---|
a ^{ *} _{ 1} | X1X2X3X4X5X6X7X8X9 | f _{ 1} ( a ^{ *} _{ 1} ), f _{ 2} ( a ^{ *} _{ 1} ), …,f _{ j} ( a ^{ *} _{ 1} ),... ,f _{ d} ( a ^{ *} _{ 1} ) |
a ^{ *} _{ 2} | Y1Y2Y3Y4Y5Y6Y7Y8Y9 | f _{ 1} ( a ^{ *} _{ 2} ), f _{ 2} ( a ^{ *} _{ 2} ), …,f _{ j} ( a ^{ *} _{ 2} ),... ,f _{ d} ( a ^{ *} _{ 2} ) |
a ^{ *} _{ 3} | Z1Z2Z3Z4Z5Z6Z7Z8Z9 | f _{ 1} ( a ^{ *} _{ 3} ), f _{ 2} ( a ^{ *} _{ 3} ), …,f _{ j} ( a ^{ *} _{ 3} ),... ,f _{ d} ( a ^{ *} _{ 3} ) |
a ^{ *} _{ 4} | U1U2U3U4U5U6U7U8U9 | f _{ 1} ( a ^{ *} _{ 4} ), f _{ 2} ( a ^{ *} _{ 4} ), …,f _{ j} ( a ^{ *} _{ 4} ),... ,f _{ d} ( a ^{ *} _{ 4} ) |
a ^{ *} _{ 5} | V1V2V3V4V5V6V7V8V9 | f _{ 1} ( a ^{ *} _{ 5} ), f _{ 2} ( a ^{ *} _{ 5} ), …,f _{ j} ( a ^{ *} _{ 5} ),... ,f _{ d} ( a ^{ *} _{ 5} ) |
... | ... | ... |
3.2 Test results
A lot of testing has been carried out. According to the statistical effect of the test, the bar code identification error rate is lower when using the improved Bloom filter in the scheme, while the error rate is higher when using the classic Bloom filter in the scheme. If there are more similar bar codes in the test sample, the false positive rate will be greatly reduced. At the same time, the test results show that this scheme is faster than the traditional scheme using databases.
4 Conclusion
The Bloom filter produces hash values by entering keys into hash functions. In the theoretical analysis of Bloom filter, it is usually assumed that the hash functions are perfect with uniform distribution. In reality, however, the perfect hash function never exists. Sometimes, due to the particularity of the data, some hash functions have different uniform distribution of hash values due to input sample. This makes it difficult to evaluate the distribution characteristics of the hash function. So, the evaluation of the uniform distribution characteristics of the hash function is generally carried out by statistical methods. Regardless of the hash function is based on the addition, multiplication, or based on the displacement, there will be more or less hash value clustering, which leads to conflicts, especially when the input keys have more approximation. The classical Bloom filter and most of its variants are all directly hashing the elements, which cannot avoid the higher false positive rate when there are more approximations.
Therefore, an improved Bloom filter is proposed in this paper. The bit vector of Bloom filter is divided into two parts. For each element to amplify the difference by g(), which makes g(a_{i}) = a^{*}_{i}, the a^{*}_{i} is mapped to another part of the bit vector by some hash functions. When an element is retrieved in both parts of the bit vector, it is shown that the element exists in the Bloom filter.
Because this algorithm transforms elements through the function g(), the differences between elements are magnified. This reduces the collision rate of the Bloom filter. Especially when the values of input elements are similar, the effect of reducing collision is obvious. This algorithm reduces the collision rate of the Bloom filter, but it does not add much space and time cost. Algorithm implementation is also easy. The results of the statistics are also better. The algorithm is also applied to bar code recognition and processing, and a specific scheme is designed. Because the bar codes of each company are concentrated, the input elements of Bloom filter are similar. According to the results of testing, this algorithm can effectively reduce the conflict rate of bar codes in Bloom filter. At the same time, the bar code automatic identification scheme designed in this paper has an advantage of high processing speed.
5 Methods
The purpose of this study is to find a method to reduce the false positive rate of Bloom filter, especially when the input contains a large number of approximate elements. The same company’s bar codes are extractly the data with a small difference. This paper uses this improved Bloom filter to process the bar codes and designs a complete scheme.
An improved Bloom filter is proposed in this paper. We divide the bit vector of Bloom filter into two parts. Consider a set A = {a_{1}, a_{2}, …, a_{n}} of n elements. If a_{i} is an element of A, mapping a_{i} to a part of the filter by some hash functions, each element to amplify the difference of g(), which makes g(a_{i}) = a^{*}_{i}. Bloom filters describe membership information of A using a bit vector V of length m. Every element a_{i} could be mapped to a part of the bit vector by some hash functions. The a^{*}_{i} is mapped to another part of the bit vector by some hash functions too. When querying an element, it means that the element belongs to the set as long as it has a corresponding value for both parts of its bit vector.
In many cases, due to the requirement of processing speed, Bloom filter has to be used to deal with bar codes. However, the bar codes of the same company often differ slightly, which easily results in the increasing of the false positive rate of the Bloom filter. A complete bar code processing scheme is designed in this paper. This scheme adopts the improved Bloom filter proposed in this paper to process bar codes effectively. The improved Bloom filter proposed in this paper is used in the scheme. The scheme needs to identify the bar code and get the corresponding data. The data is then inputted as an element into the Bloom filter for looking up. If found, the bar code is the element that exists. For extremely individual elements, if something goes wrong, they could also be queried from the original database.
Declarations
Acknowledgements
This article was also assisted by Mr. Jun Ye and others, and thanks together. In particular, we would like to thank the reviewers for their valuable comments.
Funding
This work was partially supported by the Sichuan University of Science and Engineering research fund: 2014KY01, Enterprise Informatization and Internet of Things measurement and Control Technology Key Laboratory of the University of Sichuan: 2014WYY04, Support Plan of Sichuan Science and Technology Department: 15ZC0195, and Sichuan Education Department: 17SB0345.
Availability of data and materials
Data sharing is not applicable to this article as no datasets were generated or analyzed during the current study.
Authors’ contributions
MJ is responsible for the algorithm and scheme design and most of the other work. CSZ participated in the design code. ZFM participated in the scheme design. JW is involved in the implementation of the scheme. All authors read and approved the final manuscript
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
- M. Jiang, C. Zhao, G. Xiang, in 6rd International Congress on Image and Signal Processing (CISP 2013). A modified algorithm based on the bloom filter (2013), pp. 1087–1091Google Scholar
- B. Bloom, Space, “time trade-offs in hash coding with allowable errors,”. Commun. ACM 13(7), 422–426 (1970)Google Scholar
- W.H.A. Yuen, A hybrid Bloom filter location update algorithm for wireless cellular system. IEEE Intl. Conf. on Communications vol. ICC(3), 1281–1286 (1997)Google Scholar
- L. Fan, P. Cao, J. Almeida, A. Broder, Summary cache: a scalable wide-area web cache sharing protocol (in Proceedings of ACM SIGCOMM’98, Vancouver, 1998)Google Scholar
- S.D. Gribble, E.A. Brewer, J.M. Hellerstein, D. Culler, in Proceedings of the Fourth Symposium on Operating Systems Design and Implementation. Scalable, “Distributed Data Structures for Internet Service Construction” (OSDI 2000, San Diego, 2000)Google Scholar
- S.D. Gribble, M. Welsh, R.V. Behren, E.A. Brewer, D. Culler, N. Borisov, S. Czerwinski, R. Gummadi, J. Hill, A.D. Joseph, R.H. Katz, Z. Mao, S. Ross, B. Zhao, The Ninja architecture for robust internet-scale systems and services. Comput. Netw. 35(4), 473–497 (2001)View ArticleGoogle Scholar
- T.D. Hodes, S.E. Czerwinski, B.Y. Zhao, A.D. Joseph, R.H. Katz, An architecture for secure wide-area service discovery. Wirel. Netw 8(2–3), 213–230 (2002)View ArticleGoogle Scholar
- J. Kubiatowicz, D. Bindel, Y. Chen, S. Czerwinski, P. Eaton, D. Geels, R. Gummadi, S. Rhea, H. Weatherspoon, W. Weimer, C. Wells, B. Zhao, in Proceedings of the Ninth International Conference on Architectural Support for Programming Languages and Operating Systems. OceanStore: “An Architecture for Global-Scale Persistent Storage” (ASPLOS 2000, Cambridge, 2000)Google Scholar
- M. Mitzenmacher, in Twentieth ACM Symposium on Principles of Distributed Computing. Compressed bloom filters (PODC 2001, Newport, Rhode Island, 2001)Google Scholar
- J.K. Mullin, A second look at Bloom filters. Commun. ACM 26(8), 570–571 (1983)Google Scholar
- S. Rhea, W. Weimer, “Data location in the oceanstore”, in unpublished, UC Berkeley. Journal of Nutrition Education & Behavior 44(4), S25 (1999)Google Scholar
- M.V. Ramakrishna, Practical performance of Bloom filters and parallel free-text searching. Commun. ACM 32(10), 1237–1239 (1989)Google Scholar
- K.W. Choi, D.T. Wiriaatmadja, E. Hossain, Discovering mobile applications in cellular device-to-device communications: Hash function and bloom filter-based approach. IEEE Trans. Mob. Comput. 15(2), 336–349 (2016)View ArticleGoogle Scholar
- J. Qian, Q. Zhu, H. Chen, Integer-granularity locality-sensitive bloom filter. IEEE Trans. Comput. 20(11), 2125–2128 (2016)Google Scholar
- H. Byun, J. Lee, H. Lim. "Ternary Bloom filter replacing counting Bloom filter", IEEE International Conference on Consumer Electronics-asia, 2016, ICCE-Asia: 1-4. https://doi.org/10.1109/ICCE-Asia.2016.7804774
- J. HM, H. Lim, New approach for efficient IP Address lookup using a Bloom filter in trie-based algorithms. IEEE omputer Society 65(5), 1558–1565 (2016)MathSciNetMATHGoogle Scholar
- D. Pellow, D. Filippova, C. Kingsford, Improving Bloom filter performance on sequence data using k-mer Bloom filters. J. Comput. Biol. 24(6), 547–557 (2016)View ArticleGoogle Scholar
- P. Jiang, Y. Ji, X. Wang, J. Zhu, Y. Cheng, Design of a multiple Bloom filter for distributed navigation routing. IEEE Transactions on Systems Man & Cybernetics 44(2), 254–260 (2017)View ArticleGoogle Scholar
- T. Pavlidis, J. Swartz, Y.P. Wang, Fundamentals of Bar code information theory. Computer 23(4), 74–86 (2002)View ArticleGoogle Scholar
- G. Moualla, P.A. Frangoudis, Y. Hadjadj-Aoul, S. Ait-Chellouche, A bloom-filter-based socially aware scheme for content replication in mobile ad hoc networks. Consumer Communications & Networking Conference, 359–365 (2016) https://doi.org/10.1109/CCNC.2016.7444807
- H. Ko, G. Lee, S. Pack, K. Kweon, Timer-based bloom filter aggregation for reducing signaling overhead in distributed mobility management. IEEE Trans. Mob. Comput. 15(2), 516–529 (2016)View ArticleGoogle Scholar