Domain-Specific Journal Recommendation Using a Feed Forward Neural Network
- Literature Survey
- Content-Based Recommendation System for Domain-Specific Papers
- Scraping and Data Integration (Challenges and Solutions for Data Collection)
- Limitations on the Size of the Query Results
- Fixed Limits
- Dynamic Contents
- Access Limitations
- Masked URL Parameters
- Robot Recognition and Reverse Turing Tests
- Changing the Content of Request Headers
- Selecting Appropriate Cookie Settings
- Requests and Different Time Intervals
- Altering the IP Address
S. Nickolas and K. Shobha
Information Technology (IT) has laid a strong foundation and has achieved greater heights in terms of electronic literature, leading to vast amounts of data (Zhao, Wu, and Liu 2016), which in turn leads to information overload problems (Drachsler, Hummel, and Koper 2008). With the enormous amount of data and growing competition in the research environment, scientific web archives have become more and more significant to different users (Meschenmoser. Meuschke. Hotz, and Gipp 2016). But, due to the enormous amount of data in scientific web archives, searching for a most relevant research article has become challenging and time-consuming, despite significant advances in digital libraries and information retrieval systems.
Research article recommendation demands a method that suggests significant articles to researchers via discovering their interests and preferences (Basu, Hirsh, Cohen, and Nevill-Manning 2001). By dynamically suggesting impressive materials, research article recommendations can save much time and effort for researchers (Pan and Li 2010).
Researchers are highly motivated to make their research work accessible to proliferate their research work, to increase their reputation, and to progress in their profession. Research and funding organizations often bank on scientific web archives, for example, to know the number of published papers and their citation counts for a researcher, and to appoint, promote, and to make funding decisions.
Domain-centric research article recommendations can help improve the quality and efficiency of the recommendation process by suggesting published articles similar to researchers’ interests. For this drive, we methodically queried the web archives, scraped the returned links, and inspected different features that contribute to the recommendation of publications, i.e., scraped data are curated to retain the best features (Das, Naik, Behera, Jaiswal, Mahato, and Rout 2020; Das, Naik, and Behera 2020b; Das, Naik, and Behera 2018).
Researchers have proposed numerous feature extraction algorithms for selecting the best features to perform data analytics (Das, Dey, and Balas 2019; Dey, Das, Naik, and Behera 2019) and various machine learning tasks like classification (Das, Naik, and Behera 2020c), clustering (Das, Naik, and Behera 2020a; Das, Jena, Nayak, Naik. and Behera 2015), forecasting (Rout, Jena, Rout, and Das 2020), and decision making (Dey, Ashour. Kalia, Goswami, and Das 2018).
For the notion that motivated this work, we use Google Scholar, Semantic Scholar, Scopus, Web of Science, and Microsoft Academic data to provide customizable recommendations for individuals. The aim is to support the research community by recommending the most suitable article from a curated list of publications for their domain of interest. For this drive, we developed a method to assess the publication attainment from a set of researchers who are involved in the identical research domain. The method comprises a scraper to acquire the required data. Figure 5.1 provides an overview of the tool.
Our proposed method uses a neural network to embed curated documents into a vector space by encoding the textual content of each document. We then select the nearest neighbors of a seed document as candidates and re-rank them using a second model
FIGURE 5.1 Outline of the data scraping.
trained to differentiate between cited and non-cited citations. Unlike existing works, our proposed model embeds newly published domain-centric documents in the same vector space used to identify journals similar to candidate journals based on their text content, obviating the need to retrain the models to include newly published journals.
The rest of this chapter presents mainstream challenges to scrape data from scientific web archives and plausible solutions to overcome these challenges. We also discuss pre-processing techniques used to handle this scraped data and build a rec- ommender system to suggest suitable papers.
The scientific paper recommendation is a task that aims to find and recommend the most relevant papers from a large pool, given a domain of interests (Zhao, Wu, and Liu 2016). The facility to automatically filter a broad set of documents and find those documents that are most associated with one’s research interest has its benefits. With the increasing amount of publications, many of them in web archives, it is challenging to keep track of the latest research, even if it is within one’s area of interest or domain. With the timeliness of data becoming all the more critical, it is also necessary for a paper to reach researchers with minimal delay. In this work, we consider algorithms for curating integrated data based on rules and recommending a focused set of scientific articles from a particular domain.
The most common recommendation method employed in many applications like E-commerce, tourism, and entertainment websites are content and collaborative-based filtering. Existing works shows that these filtering techniques are also used for recommending research articles. Content-based filtering constructs the user report based on the user’s reading pattern and recommends similar articles that best match the individual’s reading pattern. In existing methods, user profiling is generally built by considering the importance of keywords; however, it is inadequate to model the individual’s preference. To enhance the preference semantics, numerous methods have been proposed, such as the label-enriched approach (Guan, Wang, Bu, Chen, Yang, Cai, and He 2010) and the ontology- expansion approach (Zhang, Ni, Zhao, Liu, and Yang 2014). In the case of a collaborative filtering method, like-minded academic groups are identified first, and recommendations are generated based on their similar interests. But, the key concern in collaborative filtering is how to compute an individual's similarity. Davoodi et al. and Drachsler et al. have used data from various resources like e-mail logs, co-authors, references, and social media to analyze and to find the similarity among individuals (Davoodi. Afsharchi, and Kianmehr 2012; Drachsler, Hummel, and Koper 2008).
Many researchers have considered elements, such as domain knowledge, user background, learning targets, and cognitive patterns, apart from user preference when recommending resources (Zhao, Wu, and Liu 2016). Zhang et al. and Yang (2014) analyzed association rules between resources and courses, and recommendations were made for teaching resources. Tang et al. proposed a recommendation technique focusing on teaching features and combining the user’s knowledge level and knowledge goals (Tang and McCalla 2004).
Some researchers have modeled domain knowledge as domain taxonomy, ontology, and concept networks. Liang et al. built a semantic network based on visited documents, w'here links between each semantic tree represent the inheritance relationship between concepts (Liang. Yang, Chen, and Ku 2008). Cantador et al. proposed a cluster-based paper recommendation method where each cluster in a semantic network represents users with similar preferences (Cantador and Castells 2006). De et al. developed an adaptive ontology based on the user’s reading behaviors. The user’s action was retrieved from the ontology, and recommendations w'ere made based on the similar patterns observed (De Gemmis, Lops, Semeraro, and Musto 2015).
Meschenmoser et al. define web scraping as an automated technique to extract and retrieve targeted web data at range (Meschenmoser, Meuschke, Hotz, and Gipp 2016). A variety of tools and interfaces to build personalized scrapers, as well as customizable well equipped scraping framew'orks, exist. Glez-Pena et al. and Haddaway et al. present ample summaries of frameworks and tools for various extraction tasks, namely DataToolBar, Helium Scraper, Screen Scraper, and FMiner (Glez-Pena, Lourenqo, Lopez Fernandez, Reboiro-Jato, and Fdez-Riverola 2013; Haddaway 2015). But. there are very few scrapers for mining scientific records and bibliographic data. Smith-Unna et al. recommends a ContentMine framework that allows building personalized tools and other data mining elements (Smith-Unna and Murray-Rust 2014). Tang et al. propose an Aminer framework that collects and integrates heterogeneous social network data from many web data sources for researchers (Tang, Zhang, Yao, Li, Zhang, and Su 2008). But, the framework provides no provision for personalized content mining.
Content-Based Recommendation System for Domain-Specific Papers
In this chapter, to recommend a research article to the researchers working in a particular domain, we use a content-based recommendation system. The inspiration for journal paper recommendation is adopted from Content-Based Citation Recommendation proposed by Bhagavatula et al. (2018), which they use to recommend citations to an academic paper draft. The workflow of the proposed compendium network is presented in Figure 5.2, and each step is discussed in detail.
Scraping and Data Integration (Challenges and Solutions for Data Collection)
Commonly encountered obstacles when collecting data from scientific web archives through scraping can be categorized as follows (Meschenmoser, Meuschke, Hotz, and Gipp 2016):
- 1. Limitations on the size of the query results;
- 2. Dynamic contents;
- 3. Access limitations.
FIGURE 5.2 Workflow of DICN.
We will report on these obstacles and suggest approaches (solutions) to overcome the same in the following subsections.
Limitations on the Size of the Query Results
For any relevant query, scientific web archives often produce a fixed number of results. This fixed limit is suitable for people trying to retrieve a ranked item that is interactive, since a ranked retrieval system will usually revert to related outcomes within the first top-ten ranks. In contrast, the researcher will likely improve the search query if the desired information is not retrieved.
Before developing any scraper code, one has to inspect all the utilities reserved to retrieve any content from the web archives. This comprises spotting upper limits for data counts and examining all URL parameters. These imposed restrictions may vary for different item types; hence, it is recommended to investigate diverse data types.
For example, the search outcome for retrieving author details could be lower than the search results of scientific publications. Hence, the developer needs to design and develop the best setting for scraping a particular repository.
We have developed different scrapers for data collection and pipelining one scraper output as input to another scraper for efficient data collection. For example.
one could scrape a URL of most cited and recent publications in a particular domain list and use the URL to download PDFs and then use the result sets to extract the authors’ and co-authors’ details.
Apart from fixed limits on result sets, web archives usually use pagination for itemizing the contents in the query response set. In a standard pagination technique, the interface divides the resultant query response into many pages, each individual page displaying a preset amount of contents.
The presence of the pagination constraints may significantly affect a scraper’s effectiveness. Few web archives use pagination constraints that allow web crawlers to ingress their data efficiently. For example, if the query outcomes identify the pages with consecutive numbers that are used for pagination, a web crawler can effortlessly retrieve the result set. If the item sub-lists are defined by other keywords like “start,” “limit,” and “end,” web crawlers may even crawl and retrieve objects at random locations by posting a query.
Automated browsing through tools and frameworks by simulating a regular page visit are useful but are not very efficient for dynamic web content. For example. Selenium, a framework with WebDriver packages, supports and enables all modern browser automation for different programming languages. The drawback of this is that automated page visits decline a crawler’s function and might not be entirely consistent in every set-up. Another approach for retrieving dynamic content is that the programmer needs to have prior knowledge (prerequisite) of how data is updated to the current page. There exist two methods to add data to a Hypertext Markup Language (HTML) page.
A researcher can inspect network logs and recognize URLs and parameters that are handled during POST and GET requests to access dynamic content. Based on the server’s composition, the Web crawler will call the recognized URLs and processes the query.
Masked URL Parameters
Typically, web repositories employ non-sequential URL parameters to obstruct content mining, for example, page identifiers with randomly selected characters. Usually, web repositories use identifiers that consist of 12 characters with lower and upper case letters, digits, a dash, and an underscore, which would result in 39* 1020 possible pages. In the case of scientific web archives, a considerable amount of these identifiers may not redirect to specific pages. Hence, crawling pages in the above conditions in the case of web repositories require other ways to acquire the identifiers. An alternative method that the web archives follow to obstruct the subsequent pages in a paginated environment is by creating reliance among the URL of the succeeding page and to the content of the current page. Scientific web archives use an “after item” factor that typically represents the identifier of a data element in the current page and that has to be specific to access the succeeding pages. In this case, the efficiency of the scraper significantly gets decreased as each page must be parsed to analyze the subsequent page. Directly accessing the arbitrary sub-lists using one request becomes hard in these cases.
Robot Recognition and Reverse Turing Tests
Numerous web archives devise various techniques to identify and block robots. Much research effort has been devoted to incorporate such identification techniques, which usually depend on machine learning algorithms. Balia et al. propose a decision tree based on ID3 (Balia, Stassopoulou, and Dikaiakos 2011). Bomhardt et al. recommend neural networks (Bomhardt, Gaul, and Schmidt-Thieme 2005), and Lu and Yu employed Hidden Markov Models for this purpose (Lu and Yu 2006).
If the identification approaches suspect an automated entree effort, they may activate opposing Turing tests, which insist on specific user interactions to refacilitate entree to the web resource, for example, static text CAPTCHAs, audio or video CAPTCHAs, image classification, and Google’s “reCAPTCHA.” To prevent being categorized as a robot and to avoid denial of access, a scraper can devise the below- listed approaches.
Changing the Content of Request Headers
A scraper could change user agents by simulating a user agent’s identity for different requests.
Selecting Appropriate Cookie Settings
Depending on the scraping condition, the choice of whether to support or restrict cookies essentially needs a thorough study of the aimed-policy, because enabling cookies can help a scraper to seem more human.
Requests and Different Time Intervals
The interval between a series of requested URLs can be varied by choosing random time intervals or by deriving statistical data of user interactions. However, a tradeoff among delaying requests and a scraper’s efficiency has to be matched by developers.
Altering the IP Address
By using services such as "The Onion Router,” requests with a different IP address can be used to send a query. The modification of the Internet Protocol address can happen at fixed or at arbitrarily selected time intervals, or when the queried server outputs errors, which generally indicates that the scraper was denied access.
We will alter the user operator settings, randomize the entities to scrape, restrict cookies, and alter IPs. Different time intervals among queries is not considered since the scraper developed in this chapter is acceptable for our use, and deferring queries might maximize the processing time.