Abstract
Identifying near-duplicate data can be applied to any type of content and has been widely used for increasing search engines' efficiency, detecting plagiarism or spam, etc. As a near-duplicate detection (NDD) method, sectional MinHash (S-MinHash) estimates the similarity between the text content in high accuracy by considering the section of every document's attributes with similarity estimation. However, due to the addition in computational complexity, it still has some performance issues such as being slow. The proposed sectional Min–Max Hash method aims to reduce the hashing time while preserving and improving the accuracy of detecting near-duplicate documents. We achieved this goal by combining S-MinHash with Min–Max Hash method. The results show that our new method reduces the hashing time and provides more speed due to using half of the random hash functions that S-MinHash needs to build up the signature matrix. Furthermore, we conducted experiments to compare our sectional Min–Max Hash with the baseline methods on the evaluated dataset and confirmed that in terms of the running time and algorithm's precision, the proposed method yields better results than the S-MinHash and other NDD techniques. Also, by assuming that we have two sections, as the best-case performance for sectional algorithms on the evaluated dataset, the error rate reduced significantly in the proposed method, and the F-score reached up to 99%.
Similar content being viewed by others
References
Fellah A (2021) All-Three: Near-optimal and domain-independent algorithms for near-duplicate detection. Array 11:100070. https://doi.org/10.1016/j.array.2021.100070
Wang Y, Zeng D, Zheng X, Wang F (2009) Propagation of online news: dynamic patterns. In: 2009 IEEE International Conference on Intelligence and Security Informatics, Richardson, Texas, USA. IEEE, pp 257–259. https://doi.org/10.1109/ISI.2009.5137321
Broder AZ, Eiron N, Fontoura M, Herscovici M, Lempel R, McPherson J, Qi R, Shekita E (2006) Indexing shared content in information retrieval systems. In: International Conference on Extending Database Technology, Munich, Germany. Springer, pp 313–330. https://doi.org/10.1007/11687238_21
Bernstein Y, Shokouhi M, Zobel J (2006) Compact features for detection of near-duplicates in distributed retrieval. In: International Symposium on String Processing and Information Retrieval. Springer, Berlin, Heidelberg, pp 110–121. https://doi.org/10.1007/11880561_10
Broder AZ (2000) Identifying and filtering near-duplicate documents. In: Annual Symposium on Combinatorial Pattern Matching. Springer, pp 1–10. https://doi.org/10.1007/3-540-45123-4_1
Hassanian-esfahani R, Kargar M-j (2018) Sectional minhash for near-duplicate detection. Experts Syst Appl 99:203–212. https://doi.org/10.1016/j.eswa.2018.01.014
Broder AZ (1997) On the resemblance and containment of documents. In: Proceedings. Compression and Complexity of SEQUENCES (Cat. No. 97TB100171). IEEE, pp 21–29. https://doi.org/10.1109/SEQUEN.1997.666900
Cohen E (2016) Min-Hash Sketches: A Brief Survey. Encyclopedia of Algorithms: 1282–1287.
Zhang C, Lin Y, Zhu L, Yuan X, Long J, Huang F (2019) Hierarchical one permutation hashing: efficient multimedia near duplicate detection. Multimed Tools Appl 78(21):30537–30560. https://doi.org/10.1007/s11042-018-6178-z
Xiao Q, Wen L, Zhang Q (2021) Multi-resolution Odd Sketch for Mining Jaccard Similarities between Dynamic Streaming Sets. In: 2021 IEEE 24th International Conference on Computer Supported Cooperative Work in Design (CSCWD), Dalian, China. IEEE, pp 198–203. https://doi.org/10.1109/CSCWD49262.2021.9437641
Li P, Owen A, Zhang C-H (2012) One permutation hashing. In: 2012 26th Annual Conference on Neural Information Processing Systems (NIPS), Lake Tahoe. pp 3113–3121
Cohen E, Datar M, Fujiwara S, Gionis A, Indyk P, Motwani R, Ullman JD (2001) Finding interesting associations without support pruning. IEEE Trans Knowl Data Eng 13(1):64–78. https://doi.org/10.1109/69.908981
Cohen E, Kaplan H (2007) Bottom-k sketches: Better and more efficient estimation of aggregates. In: ACM SIGMETRICS Performance Evaluation Review. vol 1. ACM, pp 353–354. https://doi.org/10.1145/1269899.1254926
Li P, König C (2010) b-Bit minwise hashing. In: Proceedings of the 2010 19th international conference on World wide web, Raleigh, North Carolina, USA. ACM, pp 671–680. https://doi.org/10.1145/1772690.1772759
Tang J, Tian Y, Liu D (2015) Connected bit minwise hashing for large-scale linear svm. In: 2015 IEEE 12th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), Zhangjiajie, China. IEEE, pp 995–1002. https://doi.org/10.1109/FSKD.2015.7382079
Zhang W, Ji J, Zhu J, Li J, Xu H, Zhang B (2016) BitHash: An efficient bitwise Locality Sensitive Hashing method with applications. Knowl-Based Syst 97:40–47. https://doi.org/10.1016/j.knosys.2016.01.022
Ji J, Li J, Yan S, Tian Q, Zhang B (2013) Min-max hash for jaccard similarity. In: 2013 IEEE 13th International Conference on Data Mining, Dallas, Texas, USA. IEEE, pp 301–309. https://doi.org/10.1109/ICDM.2013.119
Ertl OJapa (2017) Superminhash-A new minwise hashing algorithm for jaccard similarity estimation. arXiv preprint arXiv:1706.05698
Mitzenmacher M, Pagh R, Pham N (2014) Efficient estimation for high similarities using odd sketches. In: Proceedings of the 2014 23rd international conference on World wide web, Seoul, Korea. ACM, pp 109–118. https://doi.org/10.1145/2566486.2568017
Jia P, Wang P, Tao J, Guan X (2019) A fast sketch method for mining user similarities over fully dynamic graph streams. In: 2019 IEEE 35th International Conference on Data Engineering (ICDE), Macao, China. IEEE, pp 1682–1685. https://doi.org/10.1109/ICDE.2019.00172
Wang P, Qi Y, Zhang Y, Zhai Q, Wang C, Lui JC, Guan X (2019) A memory-efficient sketch method for estimating high similarities in streaming sets. In: Proceedings of the 2019 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA. ACM, pp 25–33. https://doi.org/10.1145/3292500.3330825
Pamulaparty L, Rao CG, Rao MS (2015) XNDDF: Towards a framework for flexible near-duplicate document detection using supervised and unsupervised learning. Proc Comput Sci 48:228–235. https://doi.org/10.1016/j.procs.2015.04.175
Akram J, Mumtaz M, Luo P (2020) IBFET: Index‐based features extraction technique for scalable code clone detection at file level granularity. Softw Pract Exp 50 (1):22–46. https://doi.org/10.1002/spe.2759
Yusof N, Ismail A, Majid NAA (2019) A hybrid model for near-duplicate image detection in mapreduce environment. TEM J 8 (4): 1252–1258. https://doi.org/10.14569/IJACSA.2019.0101248
HaCohen-Kerner Y, Tayeb A (2017) Rapid detection of similar peer-reviewed scientific papers via constant number of randomized fingerprints. Inf Process Manag 53(1):70–86. https://doi.org/10.1016/j.ipm.2016.06.007
Funding
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Shayegan, MJ., Faizollahi-Samarin, M. An extended version of sectional MinHash method for near-duplicate detection. J Supercomput 78, 15638–15662 (2022). https://doi.org/10.1007/s11227-022-04447-x
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-022-04447-x