Data Curation

Scraped and integrated data sets from multiple sources often contain missing and duplicate data that must be pre-processed to avoid deceptive learning processes leading to an undesirable set of recommendations. Our pre-processing stage involves transforming crawled web URLs into a suitable form to be able to deliver high-quality data for recommending domain-specific scientific papers. More specifically, we performed several pre-processing tasks, including data integration, named-entity recognition, handling missing values, and duplicate values (Shobha and Nickolas 2020).

The web crawled data (single day data) from different scientific web archives like Google Scholar, Semantic Scholar, Scopus, Web of Science, and Microsoft Academic includes the URLs of 4,681 papers. These URLs are the link for conference papers, journal papers, posters, books, and review articles. These URLs are further pre- processed to exclude books, reviews, conferences, and posters. Only 2,521 URLs were re-scraped to retrieve the entire document. Other pre-processing steps that were carried out are explained below. [1] [2] [3] [4]

data may or may not have a unique identifier like a DOI. Without a unique identifier, it may be hard to determine whether two records are similar or not in one phase. Hence, we consider various fields simultaneously or phase by phase to handle duplicate records. The following deliberations are made in this work to remove duplicate records based on the fields available after integrating data from multiple web archives:

a. Identifying duplicate records based on DOI.

b. A YEAR field (Publication/Accepted) can be utilized to mark the duplicate elements.

c. A fusion of fields like Title of paper + paging-info, ISSN+ Title of the article.

d. If de-duplication did not happen on the above rules, fields with lower accuracy, e.g., publisher name or author name, are used.

The Complexity of the Integration Operation

The simple and basic approach to handling duplicates in “R” records after integrating scraped data from multiple web archives is to compare record Rj with Ri+I where (i = 1, 2, 3, ..., n, where n = total number of records). This approach involves R comparisons. Our work aims at integrating results of five web repositories that have three different queries (e.g., queries related to domain-specific publications, i.e., data imputation, handling missing values, missing data handling) resulting in 5 * 3 = 15 different data sets. To handle duplicates in these scraped data, a multiple join procedure has to be carried out. For every query, if the web repositories return "R” records, the total number of analogies to remove the duplicates relies upon the rate of duplicates among the list of records of the web archive outcomes in the lists.

Overall, the worst-case time complexity to carry out de-duplication is “D * R”, where ‘"D" is constant when there are no duplicate records.

Considering the time complexity of this naive method, the proposed de-duplication method uses an alternative method by reducing the number of comparisons by creating blocks based on YEAR of publication, so that only records of the specific block need to be compared, resulting in a lower number of comparisons.

The proposed method recommends a new document to the user by finding nearby candidates after embedding a seed document into the linear space. As a second step, re-ranking of the documents is done using a model that is competent enough to distinguish between cited and uncited references. The entire process of encoding and embedding the textual content of each document into a linear space is done through a neural network. We articulate paper recommendations as a ranking problem. Given a seed document S(/ and a large corpus of integrated papers, the task is to rank documents so that the document with a higher rank could be recommended to the user working in a particular domain. The advantage of crawling and scraping the documents that are relevant to a specific domain is that the number of published papers in the scientific web archives can be significantly large and that it is computationally expensive to rank each document as a candidate reference concerning seed document S(/. Hence, we recommend citations on scraped and integrated domain-specific data.

Phase 1: Identifying Candidate Journals

In this phase, our goal is to recognize a set of candidate journals that are similar to seed documents, Sd. Using a trained neural network, we first embed all the curated documents into a linear space such that domain-centric documents tend to be closer. Since the projection of a document is independent of the seed document, the entire curated documents need to be projected only once and can be reused for subsequent queries. Then, we project each seed document Sd to the same linear space and identify its “k" nearest neighbors as candidate references. For the nearby candidates, both outgoing and incoming citations are considered (Strohman, Croft, and Jensen 2007). The outcome of this phase is a list of documents S, and their corresponding scores sim(Sd, Sj), described as a cosine similarity between Sd and 5, in the embedding space. This phase yields a manageable number of candidate journals, making it practical to score each candidate Di by feeding the pair (Sd, Dj) into another neural network in Phase 2, trained to discriminate w'hether the journal should be recommended or not. An overview of this phase is represented in Figure 5.3.

A supervised learning model is used to project the contents of document D to a dense embedding. Each textual field of D is represented as a bag-of-words, and feature vectors are computed as given in Equation 5.1 (Bhagavatula, Feldman, Power, and Ammar 2018):

An overview of Phasel

FIGURE 5.3 An overview of Phasel: All the papers in the corpus (Р,—P7) in this example are projected into a linear space in addition to the seed document S<(. Nearest neighbors of Sd are chosen as the nearest candidates. Considering к = 4, P,, P(l. P,. and P4 are selected as the nearest candidates. P7 is also considered as the nearest candidate as it is cited in P,.

where df'r is a dense direction embedding and d',"“s is a scalar magnitude of word type t. The weighted average of fields is computed after normalizing the representation of each field to get the document embedding, Dc. The corresponding equation for normalizing is given in Equation 5.2. Here, we use the title, abstract, and keywords field of a document D:

where ЛТШе, Ялг,йгас', and AKe>'"'onts are scalar model parameters.

The parameters of embedding mode, that is, Л*, and di‘r, are learned using a training set T of triplets < Sd, Dcited, Dno,ci,ed > where Sd is a seed document, Dci,ed is a document cited in Sd, and Dno,ci,ed is a document not cited in Sd. Wang et al. describe in their work that the goal of the model training is to predict a high cosine similarity for the pair (Sd, Dci,ed) and a low cosine similarity for the pair (Sd, Dno,ci,ed) using the per-instance triplet loss (Wang et al. 2014). The equation representing the loss function is given in Equation 5.3.

where s(S;, D;) is defined as the cosine similarity between document embeddings cos-sim (Dei, DCJ). Here a is considered as the hyper-parameter of the model and tuned to get the best results. Choosing a positive sample for training is straightforward, any pair of (Sd, D‘iud) can be selected, where a document Sd in the training set cites Dciled. But, a careful selection of the not-cited sample is needed for training the model to show better performance. Here, the documents that are not cited in a seed document but that are near to it in the embedding space are chosen as negative samples to train the model.

Phase 2: Ranking Candidate Journals

This phase aims at training the model, w'hich takes a pair of documents (Srf, Dt) and estimates the probability as to whether D; should be recommended or not.

A vital goal of this work is to evaluate the viability of recommending journals without using metadata. FVDlfiMI, a dense feature vector, is calculated for different fields (title, abstract, and keywords) of each document.

For the document under consideration, a subset of word types in the title, abstract, and keywords are identified by conducting intersection, and the sum of their scalar weights are computed as an additional feature, for example, . Along with

this, we also consider the log of the outward connection (the number of other documents cited) of £>„ that is, log(D,/„„,.(.„„„r,„s/). Lastly, the cosine similarity between 5, and D, in the embedding space, that is, cos-sim(Dei, Dej), is used. An overview of this phase is represented in Figure 5.4.

In this phase, each pair (S, P), (S, P) (S. P,) (S. P). and (S, P) are scored separately to re-rank the papers; the top-three papers P,, P. and P are recommended

FIGURE 5.4 In this phase, each pair (Sd, P2), (Sd, P6) (Sd. P,) (Sd. P4). and (Sd, P7) are scored separately to re-rank the papers; the top-three papers P,, P4. and P7 are recommended.

Architecture: In our work, a supervised “feed forward” neural network is used to rank the documents. This network architecture has three layers, namely, two exponential linear unit layers, and one sigmoid layer. The model is as shown in Figure 5.5. This architecture is used to compute the cosine similarity between the embedding of Sd and S, for textual and categorical fields. Then, the cosine similarity scores are concatenated with numeric features, the weighted sum of intersection words, and two dense layers with the exponential linear unit, a non-linear activation function. The final layer is an output layer with a non-linear sigmoid function, which estimates the probability that 5, is recommended or not. The output layer is defined as:

Training the model: The parameters used in Phase 2 are dlmg, d‘J'r, d?, and the parameters of the three dense layers in the supervised neural network. The triplet loss function defined in Equation 5.3 is used to train these parameters and redefine the

Architecture of a neural network

FIGURE 5.5 Architecture of a neural network.

similarity functions s(S„ D,) as the neural network output set in Equation 5.4. During tests and while recommending the real-time query, we used this proposed model to recommend journal D, with the highest s(Sd, D,) scores.

  • [1] Handling irrelevant fields: Several fields in the scraped raw data sets are notneeded for recommendation purposes, such as GSrank, CitesPerYear, and Ageof the paper. These fields are removed.
  • [2] Handling duplicate fields: During the scraping process, fields with differentnames but having the same values have been collected. Since each web repository maintains its style of metadata about scientific papers, the chances of having the same values with different field names are high. When the integrationof scraped data from different scientific web repositories is done, the presenceof duplicate fields may contribute to high dimensional data. For example,fields like Estimated Citation Count (ECC) and Cites hold the same valuesforming duplicates. Hence, these duplicate fields are removed.
  • [3] Records with missing values: There are several instances in the scraped dataof missing values in different fields of articles, such as URL, DOI. and ISSN.These missing instances can be handled by human intervention by searchingfor it manually.
  • [4] Handling duplicate records: As the developed crawler and scraper aredomain-specific, they scrape data based on the domain name, for example,data imputation, handling missing values, or missing data handling. Scraped
< Prev   CONTENTS   Source   Next >