# Introduction ur world is moving towards digitized business. This opens up numerous avenues to increase revenue through digital marketing, sales forecast, etc. Huge amount of historical data is available to analyze customer behavior, buying patterns and make predictions for future. However, it also comes with challenges along the way. A substantial amount of the value to be harvested from digitization depends on successful integration of large volume of data from different sources. Unfortunately, many of the existing data sources do not share a common frame of reference. For example, let us say, a marketing team wants to use statistics from retail stores, e-commerce sites etc., to find out potential buyers for a product. Sadly, these two systems do not refer to customers in the same way -i.e., there are no common identifiers or names across the two systems. Duplicate emails or messages may be sent to same customer again and again unless customer records are tagged uniquely. Recommendations to a customer and an effective marketing scheme cannot be performed based on distinct data silos. A group of similar problems has been studied for a long time in a variety of fields under different names like entity resolution, de-duplication etc. Entity matching is the field of research dedicated to solving the problem of matching which records refer to the same real-world entity. Organizations often struggle with a plethora of customer data captured multiple times in different sources by various people in their own ways. Despite having been studied for decades, entity matching remains a challenging problem in practice. In general, there are several factors that make it difficult to solve: Poor Data Quality: Real-world data is seldom completely structured, cleansed, and homogeneous. Data originating from manual insertion may contain alternative spellings, typos, or fail to comply with the schema (e.g., mixing of first and last name). Dependency on Human Knowledge: Same data may be represented in different formats by various users like abbreviations, suffixes, prefixes, etc. To perform matching, our solution must interact with human experts and make use of their knowledge. Human interaction in itself is a complex domain. For example, let's look at a customer table from which analyst is trying to identify distinct customers. Without manual inspection and good understanding of geographical locations, it is difficult to guess whether record 2 is duplicate of 1 or 3. Somewhat ironically, as often pointed out, entity matching suffers from the problem of being referenced by different names, some referring to the exact same problem, while others are slight variations, generalizations, or specializations. In addition, the names are also not used completely consistently. Deduplication or duplicate detection is the problem of identifying records in the same data source that refer to the same entity and can be seen as the special case 1 = 2. Given such representation variations, an unprecedented number of permutations and combinations, the entity matching would be a herculean job when we handle large volume of data. Artificial intelligence and machine learning has become an essential part of multiple research fields in recent years, most notably in natural language processing and computer vision, which are concerned with unstructured data. Its most prominent advantage over systematic approaches is its ability to learn features instead of relying on step-by-step calculations. # a) Problem Definition Researchers have already realized the potential advantage of machine learning for entity matching. In this paper, we aim to propose a machine learning model for entity matching. Let E be a data source containing entities. E has the attributes (??1,??2, ...,????), and we denote entities as e = (e1, e2, ..., e??) ? E. A data source is a set of records, and a record is a tuple having a specific schema of attributes. An attribute is defined by the intended semantics of its values. So, entities e?? = e?? if and only if attributes ???? of e?? are intended to carry the same information as attributes a?? of e??, and the specific syntactics of the attribute values are irrelevant. Attributes can also have metadata (like a name) associated with them, but this does not affect the equality between them. The goal of entity matching is to find the largest possible binary relation ?? ? E × E such that ?? and ?? refer to the same entity for all (??, ??) ? ??. In other words, we would like to find all record pairs across data source that refer to the same entity. We define an entity to be something of unique existence. Attribute values are often assumed to be strings, but that is not always the case. The records are assumed to operate with the same taxonomic granularity. In this research, we will stick to the definition of deduplication (or duplicate detection) as the problem of identifying which records in the same data source refer to the same entity. The remainder of this paper is organized as follows. We discuss related work in section 2. In Section 3, we formally formulate the problem and propose our methodology. Section 4 describes how our approach is used to detect similarity in a real-world data set and the results of our experiment are explained. Finally, the paper is concluded in Section 5. # II. # Related Work Entity resolution, record linkage, deduplication and entity matching are frequently used for more or less the same problem as we mentioned earlier. It is a technique to identify data records in a single data source or across multiple data sources that refer to the same real-world entity and to correlate the records together. In entity matching, the strings that are nearly identical, but not exactly the same, are matched without explicitly having a unique identifier. Entity matching is crucial as it matches non-identical records despite all the data inconsistencies without the constant need for formulating rules. By combining databases using fuzzy matching, we can refine the data and analyze the information. Comparing big data records having nonstandard and inconsistent data from diverse sources that do not provide any unique identifier is a complex problem. In this section, we present an overview of the previous work done by researchers in entity matching. # Researchers use two major techniques as shown below: Rule-Based: Rule-based systems perform matching based on a set of manually crafted rules. To match any two records of the same entity, various string-based comparison rules are defined. Each record then would run with every other record on all these rules to decide if the two are identical. Automatic: These systems rely on machine learning algorithms to learn from data. Computers first learn from data provided for training so that they can later make predictions on unknown input data items. Usually, a rule-based system uses a set of human-crafted rules to help identify subjectivity. As the number of records increases, the number of comparisons increases exponentially in rule-based systems. With large volume of records, rule-based data matching becomes computationally challenging and unscalable. Automatic methods, contrary to rule-based systems, do not rely on manually crafted rules but on machine learning algorithms. There has been an uptick in interest on machine learning as a solution for entity matching in recent years. We note that this process is machine-oriented and does not highlight any iterative human interactions or feedback loops. First, there are several books that provide an overview. Christen [15] is a dedicated and comprehensive source on entity matching. Anhai Doan et al. [2] and Talburt [10] introduce entity matching in the context of data quality and integration. Quite early on, statisticians dominated the field of entity matching. Probabilistic methods were first developed by Newcombe et al. [15]. A solid theoretical framework was presented by Fellegi and Sunter [9]. Blocking, which is surveyed by Papadakis et al. [8,9], is considered an important subtask of entity matching. This is meant to tackle the quadratic complexity of potential matches. Christophides et al. [24] specifically review entity matching techniques in the context of big data. Significant research has gone into active learning approaches by Arvind [3], Jungo [11] and Kun [12]. Interestingly, Jungo et al. [11] use a deep neural network in their active learning approach. Such human-in-the-loop factors are often crucial for entity matching in practice as analyzed by Anhai et al. [2]. Many state-of-the-art models for natural language processing are based on deep learning networks. Central to all these approaches is how text is transformed to a numerical format suitable for a neural network. This is mainly done through embeddings, which are translations from text units to a vector spacetraditionally available in a lookup table. The text units will usually be characters or words. An embeddings lookup table may be seen as parameters to the network and can be learned together with the rest of the network endto-end. That way the network is able to learn good, distributed character or word representations for the problem at hand. The words used in a data set are often not unique to that data set, but rather just typical words from some language. Therefore, one may often get a head start by using pretrained word embeddings like word2vec, GloVe or fastText, which have been trained on enormous general corpora. One particular influential recent trend is the ability to leverage huge pretrained models that have been trained unsupervised for language modeling on massive text corpora similar to what the computer vision community has done for image recognition. They produce contextualized word embeddings that consider the surrounding words. These contextual embeddings can be used as a much more powerful variant of the classical word embeddings, but as popularized by BERT. However, with neural networks, the actual line between the initial feature extraction part and the rest is an artificial one and not necessarily indicative of how the networks actually learn and work. But they do reflect design decisions to a certain degree and help us compare them in that regard. Often these approaches use pre-built word embeddings for a specific set of values. Our research focuses on entity matching based on attributes where the number of attributes may vary from one use case to another. Also, we try to address the problem of multiple domains, i.e., the machine learning model must be suitable for entities from various categories like customers, products, vendors, etc. In this paper, we present a machine learning model which will perform attribute-based matching of entities. The type, number of attributes may vary over the time, but our approach does not require re-design. Merely a re-training of the model on the new data set will suffice. The model is robust enough to handle slight variations in ordinality and type of the attributes. # III. # Methodology Most neural network-based methods perform entity matching by producing so-called knowledge graph embeddings, embeddings of entries which incorporate information about their relationship with other entries. The embeddings work mainly at word level or character level. Embeddings offer neural networks an initial mapping from the actual input to a suitable numeric representation. When we surveyed the earlier methods, we found that researchers focus on explicit levels of representation of entities into single word or text. However, we try to address two problems mainly, ? How to perform matching of entities containing attributes of different data types, say string, boolean, and categorical? ? Will the machine learning algorithm continue to work even if the number of attributes change over the time? Let's say there are few entities in a data set as shown in Table 1. It has two duplicates. Following is a generalized notation. The entities e1 and e2 are same, though they might vary slightly in their attribute values but have similar meanings. Our aim is to design an approach which will combine the attribute level similarity and artificial intelligence to classify entities as unique or duplicate. We propose a two-step methodology where the first step involves calculating attribute level similarity scores and the second step is classification using supervised learning. Feature extraction involves use of a distance function for every pair of attributes. It transforms every pair of entities into numerical vector. For any give pair of attributes (?? ???? , ?? ???? ), the distance function ?? produces a numerical value such that 0 <= ??(?? ???? , ?? ???? ) <= 1 If the two attributes are exactly same, then the distance metric is zero. If they are completely unrelated, then the distance is 1. Partial match will result in value between 0 and 1. We call it as similarity score of the attributes. A sample set of vectors for a set of three entities will be as shown below. The extracted values correspond to two class labels duplicate (D) and unique (U). If we extract feature vectors of a data set and plot the points in a 3-dimensional space, then we will see two clusters as shown below. Our approach takes every pair of entities and produces a numerical vector. This is in turn fed to a machine learning algorithm for classification. We use supervised learning algorithm for classification. The ML model learns from the training data set and makes accurate predictions on the incoming test data. # a) Feature Extraction using Similarity Score The first step in ML modeling is data preprocessing, which is usually a crucial step in many data analytics tasks. Typical transformations involve lowercasing all letters, removing excess punctuation, normalizing values, and tokenizing. There are two other major steps in our process. Second one being the feature vector construction using similarity score and For example, consider a similarity function Levenshtein Distance. The Levenshtein distance between 'new yrk' and 'new york' is one since it needs at least one edit (insertion, deletion, or substitution) to transform from 'new yrk' to 'new york'. It is advisable to normalize the similarity scores between 0 and 1 for improved accuracy of the machine learning algorithm. # b) Classification using Supervised Learning The matching phase aims to develop the prediction model, which takes a candidate pair as input and predicts whether they are matching or nonmatching. Figure 2 illustrates that the model predicts an output label Duplicate (D) or Unique (U). This is a binary classification problem. Data scientists need to decide which algorithm is most suitable for their classification task. Based on our study and experiments, we found three classification algorithms suitable for this task. # i. K-Nearest Neighbors (KNN) K-Nearest Neighbors (KNN) is an algorithm that learns all available cases from data set and classifies new data item by a majority vote of its K neighbors. A case assigned to the data is majority of its K nearest neighbors measured by a distance (metric) function. The metric functions include Euclidean, Manhattan, Minkowski, and Hamming distances. KNN can be used for both regression and classification problems. However, it is widely used in classification problems in the industry. ii. XG Boost XG Boost stands for Extreme Gradient Boosting. It is a scalable, distributed gradient-boosted decision tree (GBDT) machine learning library. It provides parallel tree boosting and is the leading machine learning library for regression, and classification problems. # iii. Support Vector Machines (SVM) Support Vector Machine is a supervised algorithm in which the learning algorithm analyzes data and recognizes patterns. We plot the data as points in an n-dimensional space. The value of each feature is then tied to a particular co-ordinate, making it easy to classify the data. And finally, we need to tune hyper-parameters in order to get the best model performance. IV. # Experiments and Results Automatic entity matching makes the life of commercial organizations easier. A company that maintains thousands of customer records cannot afford to employ many people to verify manually and identify duplicates. Artificial Intelligence based entity matching is an efficient and cost-effective analytics tool for operational efficiency. We used open-source data sets for our experiments. While several open-source datasets are available, we picked up few commercial data sets for analysis. In this section, we describe the evaluation tasks, the data sets used, and the experimental results of our approach. Evaluation Tasks: 1. We evaluate our approach on real-world data set. 2. We evaluate our approach on popular benchmarks. Our goal is to provide real-life solution using our approach. We aim to evaluate the quality of entity matching. The empirical result is compared with realtime data to harness the accuracy. The results show promising output. # a) Data Set We conducted extensive experiments on realworld benchmark entity datasets to evaluate the performance of approach. Following are few opensource data sets available for evaluating entity matching algorithms. Many commercial organizations are nowadays struggling with customer de-duplication. Automatic deduplication has significance in various sectors like Banking and Finance, Insurance, Telecom, Retail, etc. Hence our results mainly focus on the evaluation metrics accuracy on the customer data set. # b) Popular Metrics In this section, we first describe a set of metrics commonly used for evaluating the performance of our classification model. Then we present a quantitative analysis of the performance using popular benchmarks. Accuracy and Error Rate: These are primary metrics to evaluate the quality of a classification model. Let TP, FP, TN, FN denote true positive, false positive, true negative, and false negative, respectively. The classification Accuracy and Error Rate are defined in Equation 1. ( )1 where ?? is the total number of samples. Obviously, we have Error Rate = 1 -Accuracy. Precision, Recall, and F1 Score: These are also primary metrics and are more often used than accuracy or error rate for imbalanced test sets. Precision and recall for binary classification are defined in Equation 2. The F1 score is the harmonic mean of the precision and recall, as in Equation 2. F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0. # c) Empirical Results We aim to use our entity matching systemin real world applications like retail, e-commerce etc. We analyzed the presence of duplicate customer data and results showed more that 80% accuracy in read-world data sets. Following is a set of predictions made by our system from Fodors-Zagats dataset. From the above table, we observe that customers, vendors can easily get their ambiguities resolved using automatic entity matching system. AIbased entity matching is an alternative to traditional manual or other text analysis-based tools, and it is costeffective solution for decision-makers. V. # Conclusion and Future Work The proposed method accomplished superior performance in terms of time and cost. The overall benefits of AI-based entity matching include: Sorting Data at Scale: Manually screening thousands of customer records, or product details is complex and time-consuming. AI-based entity matching helps businesses process large amount of data in an efficient and cost-effective way. # Real-Time Analysis: The automatic entity matching can help organizations quickly identify duplicates on realtime basis and act swiftly before duplicate marketing or promotional offers are sent out. Though many deep learning models are being developed nowadays for entity matching, we propose a supervised learning model for few major reasons. Explainability and Ease of Debugging: For many applications, it is crucial to trust the data source, and try to understand why something does not work is key. Unfortunately, deep learning models are notoriously hard to interpret. As steps in the entity matching process increasingly coalesce into a large neural network, we get fewer checkpoints along the way in the process that can easily be inspected. We can't see the output from each step in the same way anymore. Therefore, figuring out why two records where matched or not matched is usually nontrivial while inspecting deep learning models. There are a few techniques that are already used, such as looking at alignment scores, but we are still far away from a comprehensive way of debugging neural networks for entity matching. Our model addresses the challenges of explainability, running time in interactive settings, and the large need for training examples. Explainability of our supervised learning algorithm helps researchers to improve accuracy through inspection, comparison of algorithms and meet the real-world demands. We also see a lot of opportunities in trying to develop more open datasets, standardized benchmarks, and publicly available pretrained models for entity matching. 1![Figure 1: Feature Vectors Plotted in 3-D SpaceFor m entities having n attributes, after feature extraction, we will get m x n values under the two labels. Now, the entity matching problem is reduced to a binary classification problem, where the objective is to predict a pair of entities as unique or duplicate. Feature extraction involves attribute level comparison using fuzzy matching algorithms. The produced output is a labelled data set which can be used to train a model using supervised learning algorithm. A well-trained model will make predictions over the incoming data point. Points which lie around the boundary or away from the cluster centroid might require manual stevedoring. Following diagram shows the architecture of our machine learning based entity matching system.](image-2.png "Figure 1 :") 2![Figure 2: Architecture of Entity Matching System](image-3.png "Figure 2 :") 2![For multi-class classification problems, we can always calculate precision and recall for each class label and analyze the individual performance on class labels or average the values to get the overall precision and recall. In our case, the average for the two labels Duplicate (D) and Unique (U) were calculated and the following diagram is the pictorial representation of the metrics.](image-4.png "( 2 )") 3![Figure 3: Quantitative Metrics Analysis From the above results, we observe that XGBoost has highest F1-Score and best suited for the entity matching problem. Following table shows the final metrics of experiments conducted using various similarity score and classification algorithms over Fodors-Zagats dataset.](image-5.png "Figure 3 :") 1No.NameAddressEmail1Alexander Great2/13, Philip France Street, Paris,alex.gr@gmail.com2Alexander G2/13, Philip Street, Parisn/a3Alexander Graham10, Middle Street, New Yorkalex.gr@yahoo.com 2Entity Attribute1 Attribute2 Attribute3Labele1a11a12a13Duplicatee2a21a22a23(e1 = e2)e3a31a32a33Unique 3Entity PairScore1Score2Score3Labele1,e2??(??11 , ?? 21 ) = 0.8 ??(?? 12 , ?? 22 ) = 0.6 ??(?? 13 , ?? 23 ) = 1 D e2,e3 ??(?? 21 , ?? 31 ) = 0.5 ??(?? 22 , ?? 32 ) = 0.6 ??(?? 23 , ?? 33 ) = 0 U e1,e3 ??(?? 11 , ?? 31 ) = 0.6 ??(?? 12 , ?? 32 ) = 0.4 ??(?? 13 , ?? 33 ) = 1 U 4popularalgorithms.No.Data TypeSimilarity Function1Exact Match2Levenshtein Distance3Single Word StringJaro Distancelast step is machine learning. One might also view second as feature extraction, since records are transformed to a feature space. The success of this entity matching systems depends upon careful selection of right algorithms. Attributes are often assumed to be strings, but that is not the case always. Attributes of an entity may be of any data type like string, numeric, categorical, boolean etc. One single function will not be able to calculate similarity score for various attributes to attribute. It is useful to compare various functions available for similarity score and pick the right choice. To this end, we present a high-level overview of few 5No.DatasetDescriptionTraining SizeTesting SizeNo. of Attributes1Fodors-ZagatsCustomer records with name, address, city, phone, type, and category code.75718962iTunes-AmazonRecords of songs with song name, artist name, album name, genre, etc.43010983DBLP-ACMPublication dataset with paper title, author, venue etc.9890247344DBLP-ScholarPublication dataset with title, authors, venue, and year.22965574245Amazon-GoogleSoftware product dataset with attributes product title, manufacturer, and price.9167229336Walmart-AmazonElectronic product dataset with attributes product name, category, brand, model number, etc.8193204957Abt-BuyProduct dataset with attributes product name, price, and description.765919163 6CLASSIFICATION ALGORITHMXGBoostKNNSVM 7No.NameAddressCityPhoneLabel1restaurant ritz-carlton atlanta181 Peachtree st.Atlanta404/659 -04002ritz-carlton restaurant181 Peachtree st.Atlanta404/659 -0400D3posterior545 post st.San Francisco 415/776 -78254postrio545 post street.San Francisco 415/776 -7825D5tavern on the greenin central park at 67th stNew York212/873 -32006tavern on the greencentral park westNew York212/873 -3200D7carey's1021 cobb pkwy . semarietta770-422-8042U © 2023 Global Journals * Transformation-based Framework for Record Matching AArasu RChaudhuri Kaushik IEEE 24th International Conference on Data Engineering.ieeexplore.ieee. org 2008. 2008 * Human-in-the-loop challenges for entity matching: A midterm report AnhaiDoan AdelArdalan JeffreyBallard SanjibDas YashGovind PradapKonda HanLi SidharthMudgal ErikPaulson G C PaulSuganthan HaojunZhang Proceedings of the 2Nd Workshop on Human-In-the-Loop Data Analytics (HILDA'17). dl.acm.org the 2Nd Workshop on Human-In-the-Loop Data Analytics (HILDA'17). dl.acm.orgNew York, NY, USA 2017 12 6 * On Active Learning of Record Matching Packages ArvindArasu MichaelaGötz RaghavKaushik Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (SIGMOD '10) the 2010 ACM SIGMOD International Conference on Management of Data (SIGMOD '10)New York, NY, USA ACM 2010 * Duplicate Record Detection: A Survey P GA K Elmagarmid V SIpeirotis Verykios IEEE Trans. Knowl. Data Eng 19 2007. Jan. 2007 * End-to-end Multi-perspective Matching for Entity Resolution ChengFu XianpeiHan LeSun BoChen WeiZhang SuhuiWu HaoKong Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI'19) the 28th International Joint Conference on Artificial Intelligence (IJCAI'19)Macao, China AAAI Press 2019 * On Generating Benchmark Data for Entity Matching EkateriniIoannou NataliyaRassadko YannisVelegrakis Jr. Data Semant 2 2013. March 2013 * An Introduction to Duplicate Detection FelixNaumann MelanieHerschel 2010 Morgan and Claypool Publishers * GeorgePapadakis DimitriosSkoutas arXiv:cs.DB/1905.06167 Emmanouil Thanos, and Themis Palpanas. A Survey of Blocking and Filtering Techniques for Entity Resolution May 2019 * Comparative Analysis of Approximate Blocking Techniques for Entity Resolution GeorgePapadakis JonathanSvirsky Proceedings VLDB Endowment 9 May 2016 Avigdor Gal, and Themis Palpanas * Evaluation of Entity Resolution Approaches on Real-world Match Problems HannaKöpcke AndreasThor ErhardRahm Proceedings VLDB Endowment 3 2010. Sept. 2010 * A Theory for Record Linkage PIvan AlanBFellegi Sunter Jr. Am. Stat. Assoc 64 1969. Dec. 1969 * Entity Resolution and Information Quality John R Talburt 2011 Elsevier * Low-resource Deep Entity Resolution with Transfer and Active Learning JungoKasai SairamKun Qian YunyaoGurajada LucianLi Popa Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics the 57th Annual Meeting of the Association for Computational LinguisticsStroudsburg, PA, USA Association for Computational Linguistics 2019 * Active Learning for Large-Scale Entity Resolution LucianKun Qian PrithvirajPopa Sen Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (CIKM '17) the 2017 ACM on Conference on Information and Knowledge Management (CIKM '17) 2017 * Entity Resolution: Theory, Practice & Open Challenges LiseGetoor AshwinMachanavajjhala Proceedings VLDB Endowment 5 2012. Aug. 2012 * Distributed representations of tuples for entity resolution EMuhammad TSaravanan ShafiqJoty MouradOuzzani NanTang Proceedings VLDB Endowment 11 11 2018. July 2018 * Automatic linkage of vital records H BNewcombe J MKennedy S JAxford A PJames Science 130 1959. Oct. 1959 * Deep Learning Based Approach for Entity Resolution in Databases. In Intelligent Information and Database Systems NihelKooli RobinAllesiardo ErwanPigneul 2018 Springer International Publishing * Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection PeterChristen 2012 Springer Science & Business Media * A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication PChristen IEEE Trans. Knowl. Data Eng 24 2012. Sept. 2012 * Entity Resolution Using Convolutional Neural Network RamDeepak Gottapu CihanDagli BharamiAli Procedia Comput. Sci 95 2016. Jan. 2016 * Data Quality and Record Linkage Techniques NThomas Herzog JFritz WilliamEScheuren Winkler 2007 Springer Science & Business Media * Entity matching with transformer architectures-a step forward in data integration UrsinBrunner KurtStockinger International Conference on Extending Database Technology Copenhagen 2020. 30 March-2 April 2020 * End-to-End Entity Resolution for Big Data: A Survey VassilisChristophides VasilisEfthymiou ThemisPalpanas GeorgePapadakis KostasStefanidis arXiv:cs 2019. May 2019 * Deep Entity Matching with Pre-Trained Language Models YuliangLi JinfengLi YoshihikoSuhara AnhaiDoan Wang-ChiewTan arXiv:cs.DB/2004.00584 2020. April 2020