Document clustering is a machine learning technique that groups documents into clusters based on their similarity. For text mining, clustering is used for various functions such as document selection, organization, summarization, and visualization.

There are multiple approaches to clustering, and a wide variety of algorithms exist for completing this task.

Clustering algorithms are typically unsupervised (refer to our definition of unsupervised learning in Chapter 1).

The most well-known clustering algorithm is the “k-means” algorithm. In this algorithm, each cluster is represented by the mean of all its closest data points (i.e., each cluster representing a cluster of documents grouped by the algorithm based on their similarity measure or distance measure). Similar clustering techniques use other measures of central tendencies, such as the median or the mode. Different clustering methodologies include density-based clustering and hierarchical clustering.


Document classification is a machine learning technique that assigns predefined classes to documents.

In contrast to clustering algorithms, classification algorithms are typically supervised. Researchers provide the algorithm with training examples that include the correct class (also called classification or category) and the features used to represent each document (such as the vector-based representation). The classification algorithm then constructs a model that best maps the given features to each class.

When the training data have only two classes, a binary classifier is constructed. Where there are more than two classes, a multi-class classifier is required (Blake, 2011).

Examples of classification algorithms include:

  • • k-nearest neighbors (based on proximity to the k closest training example),
  • • Naive Bayes (probabilistic model based on Bayes’ theorem of conditional probability),
  • • Support vector machines (models where training data is represented as points in space separated into categories),
  • • Decision trees (based on a top-down tree structure with “if-then” rules learned from the training data) and other decision rules-based models, and
  • • Neural networks (models that emulate the way the human brain processes information).

There are several applications of document classification, such as spam filtering, email routing, content tagging (which improves browsing and accelerates searches in extensive unstructured text collections), and customer opinion and sentiment analysis.

Entity and Relation Extraction

Entity extraction algorithms are used to extract entities such as person names, organization names, locations, dates, phone numbers, reference numbers, prices, amounts, and other items, from documents.

Relation extraction algorithms are used to identify and characterize relations between entities such as person-organization (e.g., an employee of), person-location (e.g., born in), or organization-location (e.g., headquartered in). Some algorithms focus on event extraction, which is aimed at identifying entities that are related to an event.

Information extraction (IE) algorithms use various machine learning approaches, including rule learning-based methods, classificationbased methods, and sequential labeling-based methods (Tang et al., 2008).

• Rule learning-based systems use predefined instructions on extracting the desired information (i.e., words or text fragments) from the text. They include:

  • o Dictionary-based systems: these systems first construct a pattern (template) dictionary, and then use the dictionary to extract the needed information from text.
  • o Rule-based systems: these systems use general rules instead of a dictionary to extract information from text.
  • o Wrapper systems: a wrapper is an extraction procedure consisting of a set of extraction rules and program codes to extract information from certain structured and semistructured documents such as Web pages.
  • Classification-based systems cast information extraction as a classification task (i.e., extraction rules are built based on a classification model).
  • Sequential labeling-based extraction systems cast information extraction as a task of sequential labeling. In sequential labeling, a document is viewed as a sequence of tokens (i.e., words), and a sequence of labels (such as part-of-speech tags). These systems enable describing the dependencies between target information (to be extracted). The dependencies can be utilized to improve the accuracy of the extraction (Tang et al., 2008).

There are multitudes of applications of information extraction from documents in today’s digital business environment. Examples include:

  • • Automated metadata generation for digital libraries,
  • • Automated information extraction in day-to-day business applications for data entry automation (e.g., information extraction from resumes, receipts, invoices, legal documents, and others),
  • • Automated information extraction from emails, social media, or other text sources for purposes such as IT security, compliance monitoring, or marketing research,
  • • Automated document review and analysis for various purposes such as compliance, fraud risk or credit risk detection, audit, financial investigation, scientific research, or patent analysis,
  • • Product or movie recommender systems based on patterns extracted from purchase orders or social media contents, and
  • • Automated information extraction from SEC filings and other investor communication materials (e.g., press releases, transcripts of earnings calls) for stock market analysis, firm financial performance or stock price movement predictions, or fraud detection
< Prev   CONTENTS   Source   Next >