Computer Science notesCOM6150: Text Processing

Text Processing

A proposed definition of text processing is the creation, storage and access of text in digital form by computer.

Reasons for studying text processing includes the web, which gives access to a large corpus of text (in particular dealing with information overload, and the perceived premium on technology that can facilitate information access), as well as the automatic creation and update of web content.

Other reasons include the shift from databases to metadata (accessing semantic tags, and the automatic creation and update of metadata), and a convergence with NLP (technology that seeks to understand texts). Text processing is usually seen to have a more modest, engineering aim, but the two fields are converging and increasingly borrowing ideas and techniques from each other - particularly in statistical language processing. The distinction between the two areas is commonly seen in terms of whether a task requires some understanding of language or special linguistic knowledge.

Some applications of text processing include information retrieval (IR), which is concerned with the development of algorithms and models for retrieving relevant documents from text collections (ranging from a few hundred electronically stored documents, such as journal papers, to the billions of pages on the web). IR starts with a query, commonly just 2 or 3 words in the case of Internet search engines (but there is debate whether this is enough), and is also concerned with the algorithms that choose the relevant documents, and also how to rank these relevant documents. Currently, much work is still left to the user - selecting which of the returned documents are relevant, and then actually extracting the relevant information from these documents.

IR contrasts with information extraction (IE). IE recognises specific information in documents and makes it available to subsequent automated processes. The type of information to be extracted must be decided in advance.

The information recognised can be divided into classes of important entities (e.g., names, locations, dates, etc), and important relations amongst the entities. The information recognised can be extracted and stored in a structured record (e.g., a database), or tagged with metadata in the document itself.

Another application is text categorisation, such as in e-mail for spam filtering. This is usually achieved by having a set of documents that are representative of each category (using statistical or probabilitic methods to decide which set of a documents a new document is most like).

Linguistic Preliminaries

Linguists have assumed that language can be described at a number of levels, which can be studied independently. These levels of linguistic description include:

  • Phonetics - studying how to describe and classify speech sounds;
  • Phonology - studying the principles that govern how speech sounds are used in human language, identifying the minimal units (phonemes) that distinguish words and explains how these phonemes may be combined in words for each language;
  • Graphetics - studying the physical symbols making up writing systems;
  • Graphology - analogous to phonology, this studies the systems of symbols used in languages, their patterns and variations, and identifies the smallest units whose change affects meaning (graphemes);
  • Morphology - studying the structure of words, and identifying the smallest meaningful elements into which words can be decomposed (morphemes);
  • Syntax - studying the structure of sentences, and how this differs between languages;
  • Discourse analysis - studies the interpretation of spoken and textual discourse (i.e., of multi-sentence communications), and the various processes that connect meaning across sentences;
  • Pragmatics - studies how humans use language in social settings to achieve goals, indludes how the real intent of an utterance may be implied by indirect statement, and so must be inferred by the hearer.

Syntax

Syntax studies the principles governing how words are combined to form sentences, and how this differs across languages - i.e., the structure of sentences.

The traditional model is that words combine to form phrases, which continue to combine with words to form larger phrases, ultimately producing sentences. This implies that sentences have a hierarchial structure.

Linguists group words into classes that show similar behaviour - these are called parts of speech, or word class, or syntactic/lexical category. A possible set of these word classes includes:

  • Nouns;
  • Verbs;
  • Adjectives;
  • Adverbs (kind of like adjectives for verbs);
  • Preposition - something that indicates the direction of an action
  • Determiner - specifies a specific thing referred to by a noun;
  • Auxiliary - a subclass of verbs that do not stand alone, but instead help other verbs express their meaning;
  • Complementiser - embeds a subordinate phrase inside a main phrase;
  • Conjunction - joins two phrases together that are equal level.

This grouping is partly based on semantic intuitions, but these groupings are supported by distributional evidence - words of the same class can appear in similar contexts. This is tested by substitution (swapping one word for another) and seeing if the resulting sentence is grammatically correct.

These parts of speech can be divided into two super-groupings: open class (N, V, Adj and Adv) which have many members and change quite frequently, and; closed-class (P, Det, Aux, Comp, Conj) or functional categories, which are few in number and only change a little bit over time, but have a clear grammatical use.

A dictionary or lexicon lists the words of a language, and also specifies the part of speech for each word. Some words can have different parts of speech, particularly if they have different senses (e.g., a crash vs. to crash).

The ways that words and phrases may be combined to form sentences may be described by a grammar. There are various different approaches to formulating grammars. The most common is a phrase structure grammar (which falls under Chomsky's hierarchy as a context-free grammar). This gives rewrite rules to specify how phrases of different types are constructed, and these are called phrase structure rules (e.g., S → NP VP; NP → Det N; VP → V NP).

A grammar assigns a hierarchial structure to sentences, which is often represented as a tree called a phrase structure tree.

Morphology

Morphology is the study of the structure of words. The smallest meaningful elements into which words can be decomposed are called morphemes. Morphology is important for text processing, as we often encounter unfamiliar words, and morphology can be used to infer useful information.

There are three major types of morphological processes: inflectional, derivational and compounding.

Inflections are systematic modifications of a root by the addition of affixes (prefixes, suffixes), which changes single grammatical distinctions (e.g., plurality), but not the part of speech or the meaning in a significant way. Inflectional variants are grouped together as variants of a single lexeme.

Derivational morphology works by creating new words by combining morphemes (called derivation). This commonly involves a change to the POS, and often involves a significant change to meaning. The derivation is less systematic than inflection, and there are gaps in what is produced (e.g., quick to quickly, but not fast to fastly).

Compounding is where two or more words are merged to give a new word or lexical unit (noun-noun compounds are very common in English) that is pronounced as a single word, but often written as if separate words. They denote a single semantic concept, deserving a seaparte entry in the lexicon.

Techniques

Tokenisation

Text is commonly provided as a string, but it is often difficult to process in this format. It is usually more convenient to work with a list of tokens (i.e., sequences of units in the lexicon).

Tokenisation is the task of converting text strings to a list of tokens. The simplest approach is to split the string on whitespace characters, but this can produce anomalous results mixing words and punctuation. Instead, we could split on whitespace and all punctuation characters, but this can also produce anomalous results. We often want to preserve punctuation as separate tokens, so a system might combine several methods (such as knowledge of tokens that include punctuation, components recognising special cases, and the assuming that the remaining punctuation characters are separate tokens).

Stemming

For some tasks, we want to treat morphological variants (e.g., invent, inventing, invented, etc) as if they were the same single term (this is called term conflation).

Term conflation can be done using a lemmatiser - this reduces the word to its stem using linguistic knowledge. Lexical look-up is used for irregulars, otherwise morphological rules are applied due to the computation expense of lexical look-up.

Alternatively, we may use a simple stemmer. This encodes little or no linguistic knowledge and has no lexical look-up, so it much fast. It has an ad-hoc morphology without access to the lexicon.

The best-known simple stemmer is the Porter stemmer. It provides a set of rules to strip-off and substitute suffixes. The first applicable rule is used, and the rules are applied iteratively.

One of the problems of stemming is over conflation (e.g., police/policy, university/universe) and under conflation (e.g., Europe/European, machine/machinary) which may produce obscure non-word stems that are hard to interpret. However, there is evidence that stemming brings limited benefits for some tasks (such as machine translation or information retrieval).

Part-of-speech tagging

Part-of-speech tagging has the task of assigning part-of-speech tags to words in text. The tags are atomic labels denoting parts of speech.

There are roughly 10 core parts of speech, but tag sets used for practical corpus work is usually much bigger to incorporate additional, more fine-grained distinctions concering morphological form and syntactic function.

POS tagging is non-trivial due to ambiguity. Many words have more than 1 POS, so the correct tag must be chosen. Also, unknown words can be encountered, so the correct guess must be guessed. Exploiting constraints from the lexicon and the local context, as well as morphological knowledge can help.

POS tagging is useful for further syntactic analysis. For full parsing, it greatly reduces the search space, and provides a basis for shallow (chunk) parsing methods. It's also fairly robust for handing unknown words.

POS tagging can help for other text/NLP applications (e.g., text-to-speech generation for correct pronunciation, or for spelling correction to reduce the search space, or sense disambiguation for information retrieval).

Automatic POS taggers use a lexicon listing possible POS tags for each known word and this may include information on the relative likelihood of each tag. Hand-written systems use a hand-written set of rules to eliminate incorrect candidate tags for ambiguous words, but there is a large overhead for creating such systems.

Data-driven approaches take advantage of machine learning approaches such hidden Markov models (a statistical approach) or transformation-based learning e.g., Brill, which is rule-based approach.

Probability

Probability has been covered at great length in undergraduate modules such as MCS, AGM, etc, so will not be covered again here. The key points to remember for this module are:

  • Sample spaces;
  • Random variables (discrete and continuous);
  • Probability distributions (in particular the Gaussian distribution/bell-curve, uniform distribution, etc);
  • The axioms of probability;
  • Probability loss (if a non-zero probability is assigned to an impossible event) - this is particularly important in NLP as many text processing models have a loss like this;
  • Joint probability;
  • Conditional probability (including the chain rule, conditional independence and Bayes rule).

Bayesian classification is also essential. We often want to determine the correct hypothesis (h) given some data (D). If we restate Bayes theorem as P(h | D) = P(D | h)P(h) / P(D), then given a set of hypotheses H, we could want to find the most probable hypothesis.

This is known as the maximum a posteriori (MAP) hypothesis: hMAP = argmaxh ∈ H P(h | D), by by Bayes hMAP = argmaxh ∈ H P(D | h)P(h) / P(D). As argmax is only interested in which h maximises the value, not the value itself, then we can disregard P(D) (which is constant across H) to get hMAP = argmaxh ∈ H P(D | h)P(h).

A common case is assigning a category to an instance based on its attributes, formulated as a series of features, so the h is some category label v ∈ V and the data is a sequence of feature values for the instance, so we can reformulate the MAP approach to get vMAP = argmaxv ∈ V P(f1 ... fn | v)P(v).

This approach has an issue that it may be difficult to get good estimates of probabilities without having an infeasibly large amount of data (i.e., there may be so many combinations of feature values that each is seen rarely or not at all in the available data - 20 features with 5 different values is 100 trillion combinations), this is the data sparseness problem.

One way to solve this is to make the naive (but typically untrue) assumption that different feature values are conditionally independent given the target value (i.e., (f1 ... fn | v)P(v) = ∏ni = 1 P(fi | v).

This assumption allows us to modify the MAP approach to give the Naive Bayes Classifier vNB = argmaxv ∈ V P(v) × ∏ni = 1 P(fi | v). Crucially, this allows the probabilities P(fi | v) to be reasonably estimated from a more limited amount of data.

Although this naive approach and assumption is not, in general, valid, this approach works surprisingly well and shows performance comparable to some other ML approaches on some tasks, such as text categorisation. It is a practical Bayesian learner.

Information Retrieval

The challenge for information retrieval is that given some sort of information need (typically a keyword-based query) and a large, static document collection, to find all, and only those documents relevant to the query.

Typical IR systems search abstracts, newspapers articles, a library, or the web. These systems have multiple issues, such as query formulation, representation/indexing of documents, the retrieval model (to find the best-matching document), efficiency, presentation of results (ranking or clustering) and evaluation of the system.

Indexing

Indexing is the task of finding terms that describe documents well.

Manual indexing can be done by humans (typically using a fixed vocabulary) and is labour and training intensive (and as such typically expensive).

Manual indexing typically uses large vocabularies (several thousand items) and allows for high precision searches. It works well for closed collections (such as books in a library), but labelers need to be trained for high consistency, and if the documents are dynamic or schemes change constantly, it causes issues.

Descriptors normally are structured into a hierarchy, e.g., the MeSH (Medical Subject Headings) has a number of top-level categories, from which you can drill down from any heading to more specific/detailed terms. Given descriptors could appear at several places in the hierarchy, e.g., Stomach Neoplasms come under both Stomach Diseases and Gastrointestinal Neoplasms.

Automatic indexing uses term manipulation (where certain words count as the same term), weighting (where certain terms are more important than others) and these terms must be derived from the text.

With automatic indexing, there is no predefined set of index terms, instead natural language is used as the indexing language and we use the words in the document to give information about its content. This is typically implemented using inverted file indices.

A basic inverted file index records, for each term, the IDs of the documents in which it appears. It matters only if it does, or does not, appear in a given document, not the frequency of occurrence in that document.

A more sophisticated version also records the occurrences within each document, which is useful to help find documents more relevant to the query. If we extend this to also record the position of each term occurrence within the document, this may be useful for the searching of phrases in documents.

Retrieval Models

Searching falls into two classes: boolean search (a simple "is this document relevant or not" search, where the presence of a term is both necessary and sufficient for a match, and the boolean operators are set operations), and ranked algorithms (where not all the search terms are necessarily present in the document, and frequency of the document terms is an important characteristic) such as the vector space model or the probabilistic model (and other implementations such as in web search engines).

The bag of words approach is the standard approach to representing documents and queries for information retrieval. It records what words (terms) are present, plus usually a count of terms in each document. It does, however, ignore relations (i.e., order, proximity, etc) between words.

Boolean Model

The Boolean model constructs complex search commands by combining basic search terms (keywords) using Boolean operators (AND, OR, NOT, BUT (AND+NOT) and XOR).

It is often used for bibliographic search engines (such as in a library), but expert knowledge is necessary in order to create high-precision queries, and the relevance definition is binary - the result list in unranked.

Boolean queries provide a simple logical basis for deciding whether a document should be returned based on whether the basic terms of the query appear in the document and the meaning of the logical operators, but testing each document in turn is infeasible.

Boolean operators have a set-theoretic interpretation, which provides a basis for an efficient retrieval approach. Any basic term corresponds to the set of documents in which that term appears, and Boolean operators map to the set-theroetic operations: AND to intersection, OR to union, NOT to complement and BUT to difference.

Vector Space Model

In this model, documents are represented as bags of words and are points in high-dimensional vector space (the terms included in the index define the dimensions of the space - i.e., each term forms a dimension, so vectors are generally sparse).

Queries are also represeted in vector space based on the terms in the query (that are included in the index), and the documents with the highest document-query similarity are selected. This document-query similarity can be used as the model for relevance (and ranking).

This is defined
in lecture 8.

As each document is represented as a vector of n values, the similarity measure used by the vector-space model is the cosine of the angle between the two vectors. This value ranges from 1 (for vectors pointing in the same direction) to -1 (for vectors pointing in opposite directions, although this last case is not relevant where all the term weights are >e; 0). A cosine value of 0 means that the angles are orthogonal.

The Euclidean length of a vector can be computed as a generalisation of Pythagoras's equation, and the cosine can be interpreted as the normalised correlation coefficient, i.e., how well xi and yi correlate and then dividing by the length of the vectors to scale for magnitude.

Another option is the normalise their vectors in advance (i.e., scale so length = 1), then the cosine is equal to the dot product.

Comparison is then done by taking the cosine of the query and the document.

Information retrieval as discussed so far relies on this idea of terms, however, we must define what exactly a term is. We could just use the words, but this requires an adequate tokenisation approach. Another issues is that of capitalisation - do we conflate Turkey to turkey, for example? Also, single and multi-word terms are an issue (is it cheque book, or separate tokens cheque, and book).

A stemmer can be used to conflate morphological variants.

Stoplists can be used to exclude non-content words (e.g., a, am, etc) which greatly reduces the size of the inverted index, but causes problems if we want to search for phrases that include those terms.

Term Weighting

In binary weighting, whether or not each term is present is recovered, but this discounts documents with multiple occurences of the same keyword which may be more relevant.

We could just use the frequency of the term in the document, but if that term is generally frequently anyway, it is not very useful for discriminating relevant documents. We want to weight terms highly if they are frequent in relevant documents, but infrequent in the collection as a whole.

Some key concepts for this are: D, the document collection (and |D|, the total number of documents in the collection), the term frequency tfw,d, which is the number of times w occurs in document d, the collection frequency cfw which is the number of times w occurs in the collection and the document frequency dfw which is the number of documents containing w.

The idea is that less common terms are more useful to finding relevant documents (i.e., that these terms are more informative), and this idea is best addressed using document frequency. For semantically focussed terms (e.g., insurance, instead of a more general try), then document frequency reflects this difference, but collection frequency may be similar.

Informativeness is therefore inveresly related to document frequency (i.e., less common terms are more useful to finding relevant documents and more common terms less useful), so we could compute a measure: |D|/dfw, whose value reduces as dfw gets larger, tending to 1 as dfw approaches |D|, but the value is very large for a small dfw, so over-weights such cases. This can be modified using log.

This leads to the widely used measure inverse document frequency (IDF): idfw, D = log |D| / dfw

However, not all terms describe a document equally well. Terms which are frequent in a document are better (i.e., the term frequency), and terms that are rare in the document collection are better (the IDF). We can combine these two to give the TF·IDF term weighting: tf.idfw,d,D = tfw,d · iddw,D.

This can be compared with the cosine measure to discover document/query similarity.

Evaluation

Effectiveness is commonly evaluated in relation to the relevance of the documents retrieved, but this relevance is typically not binary, but continuous. Even if we idealise relevance to be binary, this can be a difficult judgement to make.

For humans, relevance is subjective, situational and dynamic.

Additionally, other factors may be considered when evaluating systems, e.g., user effort/ease of use, response time, form of presentation, and collection coverage.

Performance of competing systems can be compared using a benchmarking corpus, providing a standard set of documents and queries and a list of documents judged relevant for each query by human subject, where relevance is treated in a binary way. Benchmark corpora are very expensive to develop, and performance results and system comparisons are only valid for the enviroment in which the evaluation has been made.

Confusion matrices have been
covered in machine learning.

A confusion matrix of relevant/non-relevant and retrieved/not-retrieved can be used to derive recall and precision metrics for the evaluation of the IR systems.

There is a trade off between precision and recall - as more results are examined, precision generally drops, whilst recall increases. Given this, any approach for evaluating and/or comparing system performances must combine both precision and recall. Ideally, this should combine performance across a range of different points on the precision/recall graph, but even then it's not obvious which performance is better.

F-measure

The f-measure can be used to combine precision and recall into a single figure: Fα = 1/(α/P + (1 = α)/R).

The value α determines the weighting of P vs. R. A high α prioritises recall, and a low α prioritises precision. In the common case where α = 0.5, we can express F0.5 as 2PR/(P + R).

F is a harmonic mean, behaving different to the arithmetic mean, therefore penalising low performance in one value more than averaging does.

Precision at a cut-off point is useful in determining how well a method ranks relevant documents before non-relevant documents. We choose a limit on the number of documents returned (the most relevant ones first) and then compute precision at that point.

Average precision aggregates many precision numbers into one evaluation figure. The precision is computed for each point a relevant document is found, and figures averaged, i.e., starting from the most relevant document to least relevant, we increment a number every time a relevant document is found, and then divide that at the rank of that document.

e.g., for a result where the 2nd, 3rd, 5th, 6th and 7th documents are relevant, the average precision is 1/2 (first document found/rank) + 2/3 + 3/6 + 4/7 + 5/8 = 0.573.

For interpolated average precision, and interpolated precision score is computed at a number of different recall levels (typically, 0, 10, 20, etc%), then these are averaged.

The interpolated precision is computed for each given recall level by computing the highest precision score observed after that recall level is reached. This is based on the assumption that the typical user is willing to look at more documents if this would increase the percentage of relevant documents amongst those viewed. Typically, the precision of documents viewed will tend to go down as you move down the ranking, but can sometimes go up.

Machine Translation

Machine translation is a system for translating one natural language to another by means of a computerised system.

Machine translation dates back a considerable amount of time. In 1949, Warren Weaver writes a memo proposing some key MT ideas based on WWII code-breaking success and Shannon's information theory. In the 1950s, MT work started at various US Universities, but in 1966, the National Research Council massively cut MT funding claiming that there were enough human translators to solve the problem and that MT wasn't making enough progress.

Research continued in the private sector, and from 1978-1992, the EU funded the Eurotra project. In the 1990s, statistical/example-based translation systems emerged.

The biggest challenges of MT are lexical ambiguity (different word senses), syntactic ambiguity and pronoun resolution.

Translationally equivalent sentences may have very different distributions of meaning and syntactic structures, e.g., The bottle floated into the cave ⇒ La botella entro a le cuerva flotando ⇒ The bottle entered the cave floating. Also, some languages have very different word orders, e.g., English's subject-verb-object compared to Japanese's subject-object-verb.

In particular, it is important to known that the best translation may not be a 1-to-1 mapping (the translated work may not translate back directly into the original).

Classical MT recognises three key approaches: direct, transfer-based and interlingual.

Direct Translation

Direct is word-for word and typically happens after the morphological analysis stage. Transfer-based can either be a transfer of syntactic or semantic structure, syntactic analysis at a lower level than semantic analysis, and an interlingual approach is the highest level.

This can be thought of as V-diagram, with morphological, syntactic and semantic analysis and then semantic composition up to the interlingual from the source text representation, and then semantic decomposition, semantic, syntactic and morphological generation leading.

Direct translation is word-by-word and has very little analysis of the source text and relies on a large bilingual dictionary (for each word, a dictionary specificies a set of rules for its translation). Often after translation, some simple reordering rules are applied (e.g., moving adjectives after nouns when translating English to French).

This lack of analysis causes several problems: it is difficult or impossible to capture long-range reorderings, and words are translated without any disambiguation of their syntactic role (e.g., noun, verb, etc).

Transfer-based Translation

Transfer-based MT relies on an intermediate representation produced by the analysis of source text, and can be either syntactic or semantic.

The translation process has three phases: analysis - analysing the source-language sentence to produce an intermediate representation; transfer - converting the source-language intermediate representation to the target-language intermediate representation; and generation - generate a target-language sentence from the converted intermediate representation.

The intermediate representations used by different systems vary wildly - from shallow analyses to much deeper ones (e.g., fully semantic representations). The transfer process is performed by a set of rules, showing some similarity to the direct approach, but the conditions and operations refer to the structure of the IRs. This approach makes it much easier to handle long-distance reorderings.

SYSTRAN is arguably the most successful MT system to date, and is transfer-based.

Interlingua-based Translations

Interlingua-based MT relies on an intermediate representatino which is language-independent, or interlingua.

Here, the translation process has two phases: analysis, which produces the language independent representation of the meaning of the source-language text, and generation, which generates a target-language sentence from the interlingual meaning representation.

Transfer-based and direct MT approaches are O(N2), as to achieve translation amongst N languages, N2 - N translation systems are needed as there are (N2 - N)/2 language pairs. The interlingual approach avoids this problem, instead requiring only N analysis/generation systems. Likewise, to add a new language, interlingual approaches only required the addition of 1 analysis/generation system, transfer-based systems require N new translation systems.

However, there is a key issues on the difficulty of formulating an interlingua to represent all possible languages - different languages break down concepts in different ways, and there are many choices in selecting basic concepts, so some sort of way to choose these needs to be decided.

An interlingua approach requires complex semantic processing, for decomposing and reassembling conceptual units.

Statistical Machine Translation

Statistical machine translation seeks to generate translations using statistical methods.

Bilingual text corpora (e.g., the Canadian Hansard) are used as data in developing their statistical models.

The approach was first seriously developed in the 1990s by IBM in the CANDIDE system, but the idea dates back to 1949 and Warren Weaver. Inspired by WWII code breaking, it was suggested that Russian text is treated as encoded English, and that the data needs to just be decoded, however this approach was not feasible on early computers so the idea was not pursued.

A key concept for SMT is the noisy channel model - an idea developed by Shannon to model communication (e.g., communication over an imperfect phone line) where a source goes through a noisy channel and is decoded, guessing at the original message.

The output of the channel is assumed to depend probabilistically on the input.

This model was pioneered by IBM in the early 1990s. The approach is such that it is assumed that someone has the target-language in their head, but by the time it is written down it is "corrupted" to become the source-language. If you can then recover the "original" target-language that was "corrupted" to give the source-language, it will serve as a translation from source-language to target-language.

The basic method is that, given a sentence F (source-language), search for a sentence E* (target-language) that maximises P(E|F). To recover the original E, we need to reason about what kind of things people say in our target-language, and how our target-language gets turned into the source-language.

Therefore E* = argmaxEP(E|F), or by Bayes rule, E* = argmaxEP(E)P(F|E), therefore to translate from F to E, we need to model P(F|E).

Bayes rule is needed because P(E|F) is difficult to model directly - needing very good probability estimates that are very difficult to acquire. The decomposition to P(E)P(F|E) allows us to be sloppy, resulting in good translations with probabilities that are easier to acquire.

P(E) worries about good English (fluency), and P(F|E) worries about faithfulness - it ensures that E has words that generally translate F, but doesn't need to worry about their order (i.e., it's a bag of words). P(E) and P(F|E) can be trained independently.

To translate, a number of components are required: a language model (modelling P(E)) which judges if candidates Es are good strings of English (and are assigned probabilities). They provide fluent sentences to test as possible translations, and LMs are created from a large volume of target language text, which learns which word sequences are common or more probable, and use n-gram language modelling.

Additionally, a translation model (which models P(F|E)) is used, which tests for faithfulness - if a fluent sentence is a likely translation. It assigns a probability to a (F, E) pair, given E. Translation models generally learn by analysing large amounts of parallel text.

The final component needed is a decoder (as in the noisy channel model, in this case - argmax) which provides an effective/efficient search to find E*. As the search space may be infinite, a heuristic search approach is required.

Learning translation models requires a bilingual corpus, which must be sentence-aligned. This is non-trivial, as documents rarely have a 1-to-1 mapping of sentences - human translations may merge/split sentences, or even add/delete sentences. There are some algorithms for automatic alignment, such as Gale and Church's 1993 algorithm, which uses only sentence lengths to evaluate alignments.

A sentence-aligned bilingual corpus must also be word-aligned. Alignment can be learned (e.g., using the Expectation Maximisation algorithm - initially alternative word alignments are considered equally likely, and then probabilities of word pairings are increased when they occur, having a knock-on effect for alignment of other words. Probabilities are then iteratively redistributed, until they identify the most probably link, and hence translation, for each word).

This approach is limited to closely related language pairs, and it is difficult to do alignment (and hence learn a translation model) for languages that are structually quite different. Additionally, languages may have different notions of what counts as a word (e.g., compounding in German produces very large word units) and have different word orders. The word alignment process doesn't consider all possible alignments, this is too costly, and instead limits itself to only those involving local reordering.

Word-based SMT has trouble where several words in the target-language map to one word in the source-language - early IBM models can't handle this case. Another issue is that of phrases, when phrasal units in language should be translated as such (e.g., pomme de terre), as well as a general difficulty in languages that are not closely related (e.g., having very different word orders).

Evaluation

A key problem in evaluation of MT systems is that text can be translated in multiple ways, each of which is valid. So, how do we decide if a given translation is good, or if one MT system performs betters than another.

Some possible approaches include human evaluation - scoring of translations, a comprehension test, or testing round-trip translation; testing in an application that uses MT as a sub-component (cross-lingual information retrieval or question answering involving foreign language documents) or automatic (using metrics such as word error rate (WER) or the BLEU (bilingual evaluation understudy) framework).

Round-trip Translation

Round trip translation (RTT), or back-and-forth translation, uses a MT system to translate some text/sentence from a source language to some target language (forward translation, FT), and then back into the source language (back translation, BT). The initial text and the back translation result are then compared.

This method is often used by lay users who usually make the naive assumption that the input text and the back translation should be the same. It can be done knowing only one language.

RTT is rejected by experts in the field. The method is really testing two MT systems, when the result is poor, it is unknown if the damage is done during the FT or BT step. Additionally, even if the BT result is okay, it doesn't mean the FT was.

The basic premise that input and BT should be the same is flawed - even human translations would not reproduce input in a RTT.

Human Evaluation

In the comprehension test approach, human subjects read the system output and are then asked to answer questions (in the target language) which they should be able to answer from the content of the initial source language text.

This approach tests whether the content of the original has come through into the translation result in an understandable form.

Another human evaluation technique use a subjective scoring of adequacy and fluency. Once subjects have read the reference translation, then the system output is scored for adequacy (how much of the reference meaning is expressed in the system output) and fluency (how fluent the output language is).

This method is widely used, but has been shown to exhibit fairly poor inter-annotator agreement.

Another method is sentence ranking. Human subjects are given the sentence translation outputs of several systems and ranks them from best to worst (with ties allowed). The scores are then computed over a large set of such ranking results, based on the average number of times one system ranks higher than some other.

This approach can only be used for direct comparison between different systems (there is no independent score), but has been shown to exhibit better inter-annotator agreement than the subjective fluency/adequacy methods.

BLEU - Bilingual Evaluation Understudy

Human evaluation of translations is generally viewed as more reliable, but is expensive and slow. An alternative of cheap and fast automated evaluation is needed.

One prominent approach for automated evaluation is BLEU (bilingual evaluation understudy) which assumes that the closer a machine translation is to a professional human translation, the better is is.

The approach compares a candidate translation to multiple human-created reference translations. These are initially costly to produce, but are endlessly reusable.

The number of n-grams (word sequences) in common is counted, and then (typically) n-gram overlap measures for n-grams of length 1-4 are computed, and combined into a single score.

The use of multiple translations allows for variations in style and word choice by human translators. Combining the scores for multiple n-gram overlap measures requires a high-scoring candidate to resemble reference translations in both word choice and word order.

A simple overlap measure for comparing a candidate to reference translations is n-gram precision, i.e., for a given n-gram length: correct n-grams in candidate/total n-grams in candidate, where a correct n-gram is one that appears in any reference translation.

However, a system may unfairly benefit from overuse of some correct term (e.g., "the the the the the the the" compared to "the cat is on the mat" gives a score of 100%). One way to combat this is to clip the counts, that is, don't let any n-gram count more than the maximum number of times it appears in a reference translation.

We define Countclip = min(Count, Max_Ref_Count), and then the overall score for the sentence is computed as the sum of clipped n-gram counts for the candidate/total n-grams in the candidate.

In practice, scores are computed over a multi-sentence test set. The system produces a set of candidate translations Cand, then the corpus precision for n-grams of length n is compted as:

pn = (∑C ∈ CandNgram ∈ CCountclip(Ngram))/((∑C ∈ CandNgram ∈ CCount(Ngram))).

The precision scores for different n-gram lengths 1-4 is then combined as a geometric mean, i.e., (p1p2p3p4)1/4.

However, this n-gram precision alone is not enough, as incomplete translations may score highly. This is addressed by including a sentence brevity penalty BP that punished short translations.

For each candidate, the reference having the length closest to the candidate is chosen, and the penalty score BP computed across the corpus for length divergence between all candidates (Cs) and closest (length) references (Rs). If len(Cs) ≥ len(Rs), then BP = 1, else BP has some value less than 1.

Combining the n-gram precision scores with the sentence brevity penlty gives us the overall BLEU score: BLEU = BP × (p1p2p3p4)1/4. This formula assume that the different n-gram precision scores are equally weighted, and the full official formula allows the n-gram precision scores to be differentially weighted.

Experiments have been done comparing BLEU scores for systems against those produced by human judges and results suggest that BLEU scores correlates well with human evaluations.

Text Compression

Recent years have seen a dramatic increase in storage, transmission bandwidth and processor speed, but also in data volume. Techniques to compress this data remain significant.

Text compression is distinct from some other forms of data compression in that the text must be exactly reconstructable, or lossless. For digital analogue signals (such as image or sound) this is not so critical and lossy compression methods are often applied.

Text compression techniques can be classified as either symbolwise or dictionary, and static or adaptive.

Dictionary methods work by replacing word and text fragments with an index to an entry in a dictionary, whereas symbolwise methods work by estimating the probabilities of symbols (characters or words) and then coding one symbol at a time using shorter codewords for the more likely symbols. This relies on a modeling step (estimating the probabilities for the symbols in the text, and better probability estimates result in higher compression) and a coding step (converting the text into a bitstream based on the probabilities in the model).

Static methods use a fixed model or dictionary which is derived in advance of the text to be compressed. Semi-static uses a two-pass approach, first to build the model/dictionary and then applying it in the second pass. Adaptive methods build the model/dictionary and then apply it in the same pass.

Models

The function of a model is to predict symbols. The model amounts to a probability distribution for all possibly symbols (e.g., the alphabet). The encoder uses the model to encode (i.e., compress) the text, and the decoder than uses this same model to decode it.

The number of bits in which a symbol s should be coded is its information content - l(s). This is related to predicted probability Pr[s]: l(s) = -log2Pr[s].

The entropy, H, of the probability distribution (the average information per symbol over the whole alphabet) is given by:

H = ∑sPr[s].l(s) = ∑s-Pr[s].log2Pr[s] bits/character.

The entropy H places a lower bound on the compression possible, according to Shannon's Source Coding Theorem.

The probability of encountering a given symbol at a particular place in a text is often influenced by preceding symbols. Models that take immediately preceding symbols into account are called finite-context models, and the best text compression results consider contexts of 3-5 characters.

Probabilities may be estimated in various ways. Static modelling performs poorly on texts which are atypical from those used to construct the model, but semi-static models are inefficient because the model must be transmitted alongside the text, and two passes over the text is required.

Adaptive Models

Adaptive models begin with a base probability distribution and this model is then refined as more symbols are encountered durnig the encoding. Therefore, the text being encoded itself is used to re-define the model.

The decoder starts with the same base probability distribution and as it is decoding the same symbol sequency, can refine the model in the same way.

Care must be taken to ensure no character is ever predicted with zero probability simply because it has not been seen yet and a principal disadvantage is that it is not suitable for random access to a file as text can only be decoded from the beginning. This poses difficulties for use in retrieval applications.

Coding

Coding is the task of determining the output representatino of a symbol given the probability distribution supplied by the model. A coder should output short codes for high probability symbols and long codes for low probability symbols. Coder speed may also be significant as computing optimal codes can be slow, hence a trade-off between speed of compression and compression rate is often used.

Huffman Coding

Huffman coding dates from 1952 and was dominant until the 1970s. Refined versions of it are still in use today.

A code tree is built for encoding and decoding. Each branch is labelled with a 0 or 1 and then each leaf node is a symbol in the alphabet. This results in a prefix-free code, where no codeword is the prefix of another.

The code tree is constructed bottom up from the probabilistic model according to the following algorithm:

1. Associate probabilities and symbols with leaf nodes;
2. Two nodes with the smallest probabilities are identified and joined under a new parent node whose probability is their sum;
3. Repeat step 2 until only one node remains;
4. Assign a 0 or 1 to each tree edge.

The path from the root node to the leaf containing the character is then used to derive the code (the 0s and 1s from the edges along the way).

Huffman coding is fast for both encoding and decoding, provided that the probability distribution is static, and variants have been developed for adaptive models, but these are complex and arithmetic coding is a better bet.

It is most effective when a word-based rather than character-based model is used. Huffman coding is fast, gives good compression and supports random access to compressed files.

To support the large models that arise with word-based encoding, a different but highly efficient representation of Huffman codes, called canonical Huffman codes, is used. The lengths of the canonical codes are determined as before, but within blocks of codes that are the same length, the symbols are assigned in alphabetical order, so only the codeword for the first word in each block of a given length plus the ordered list of words of each length is stored. Consequently, a full decode tree need not be stored and used for decoding.

To use canonical Huffman codes, the first n-bit codeword is searched for, and then the binary difference d between the first code and our actual code is used, and d is then used as the offset into the list of 7-bit words.

Arithmetic Coding

Arithmetic coding allows for excellent compression. It can code arbitrarily close to entropy, hence is optimal in terms of compression.

In the case of a distribution being very skewed (e.g., Pr[a] = 0.99 and Pr[b] = 0.01), it can win over Huffman coding as a can be coded in -log2Pr[s] = 0.015 bits, whereas Huffman coding requires at least 1-bit per symbol.

Arithmetic coding is most suited for sophisticated adaptive models. However, it is slower than Huffman coding, and it is not easy to start decoding in the middle of a compressed stream and hence less suitable for full-text retrieval where random access to compressed text may be needed.

Arithmetic coding is often most useful for images, whereas Huffman is for text.

The output of an arithmetic coder is a stream of bits, but it is easier to think of it as a fractional binary number between 0 and 1. The part after the decimal point is used as the code.

The arithmetic coder works by storing two numbers - high and low - representing a subinterval of [0,1] used to code the next symbol. Initially high = 1 and low = 0.

The range between high and low is then sub-divided according to the probability distribution of the model and sub-intervals allocated for coding each symbol. The coding step involves resetting the high/low values to narrow the recorded interval.

This process repeats to continue narrowing the interval until the final interval represents the full input. The encoder can choose any code in between high and low (the shortest code of course is the best choice). The transmitted code plus the probability model is then sufficient for decoding. The decoder simulates the steps in the encoding process, and as it does, recovers the encoded content.

The smaller the final interval is, the more bits that will be needed to specify a number that falls within it.

Intuitively, the method works because high probablity events narrow the interval much less than low probability events do, and the number of bits required is proportional to the negative logarithm of the size of the interval. The interval size is the product of the probabilities of the coded symbols and the logarithm of this is the sum of the logs of the individual probabilities. Hence, a symbol of probability Pr[s] contributes -log2Pr[s] bits to the output. This is equivalent to the symbol's information content, hence the method is near-optimal. The code size is identical to the theoretical bound given by the entropy and high probability symbols can be coded in a fraction of a bit.

The method is limited in practice by the need to eventually transmit a whole number of bits/bytes and limited precision arithmetic. Also, the method produces no output until the encoding is complete (i.e., all input processed). In practice it is possible to output bits during coding, which avoids having to work with higher and higher precision numbers.

We can observe that when the range is small, high/low have a common prefix and we can drop this prefix and reset high/low, allowing output to be generated incrementally.

Arithmetic coding is more commonly used with adaptive modelling and the probabilities used are based on counts observed in the text and updated dynamically as it continues. The decoder can simulate this encoding process including all the counting/model updates.

These symbolwise methods work by estimating the probabilities of symbols (characters/words) and coding one symbol at a time using shorter codewords for the more likely symbols.

The symbolwise models considered above are zero-order character models (i.e., they do not consider left context and are therefore not a serious contendor for effective text compression). Two alternative symbolwise models include prediction-by-partial-matching (PPM) which obtain arguably the best compression performance and zero-order word-based models, which are very effective in information retrieval applications, however there are many others. These models produce probability distributions that can be fed into either Huffman or arithmetic coders.

Prediction-by-Partial-Matching

Prediction-by-Partial-Matching (PPM) is a character-based symbolwise model that determines probabilities in a given (finite) context, but where the length of the context may vary depending on contexts previously seen.

PPM starts with the maximum context and considers whether the preceeding characters to the next character to be encoded have occured before in the prior text. If it has not, the PPM shifts to a smaller context and sees if this snippet of text has occured before. The decoder can do this too, as it has the same knowledge of prior contexts. If it has, but was followed by a different character, then the PPM sends a special escape character to tell the decoder that the character can not be encoded in this context, and should drop down to the next smaller context.

In the case where the preceeding context has occured m times before, and in k of these cases was followed by our next character, s, then the P(s) = k/(m + e) in this context, where e is the number of escape characters used. This probability is then used for coding.

In the worst case, 6 coding steps are required, but this is rare, especially if the model has been trained on earlier text. There are many proposed variants and extensions of PPM, and PPM models arguably provide the best compression performance.

Word-based Models

So far, it has been assumed that symbols are characters, however in word-based models, symbols are words (alphanumeric strings) and non-words (white-space and punctuation). Documents are assumed to consist of strictly alternating words and non-words and, typically, a zero-order model is built for words, and a separate one for non-words.

If adaptive models are used, a mechanism is needed for previously unseen words and nonwords, e.g., sending an escape symbol and then spelling out the word character by character, using a zero-order model of characters.

The "parsing" of text into words and non-words raises a number of issues, e.g., handling punctuation that is part of a word (hyphens, apostrophes, etc), handling numeric strings (could lead to a large number of words that only appear once), as well as ideographic languages, where segmenting text into words presents a larger challenge.

Word-based models can yield a large number of symbols, so an efficient data structure for the model is important - canonical Huffman code is good for static/semi-static versions.

Word-based models can achieve compression performance close to PPM, and with semi-static models, random access is supported.

Dictionary Models

Dictionary methods rely on replacing substrings in a text with codewords, or indices, that identify the substring in a dictionary or codebook.

One simple example is digram coding which expands ASCII to an 8-bit system with the higher 128 code points representing common letter pairs. At worst, each 7-bit character is expanded to 8-bits, and at best, character pairs (14-bits) are reduced to 8-bits. Such approaches can be extended to include larger dictionary entries (such as common words and prefixes/suffixes).

However, good compression is difficult with a general dictionary, as words common to many texts tend to be short, as a dictionary suited to one sort of text may be poor for another. Semi-static schemes can be used, but transmitting such a dictionary is often inefficient and it is hard to decide which words are selected for a dictionary.

Adaptive dictionary schemes are often best, and most schemes are based on two methods proposed by Ziv and Lempel in 1977/78 (LZ77/LZ78). The key idea in these is to replace a substring with a point to a previous occurrence in the same text.

LZ77

The LZ77 encoder outputs a series of triples, where the first component indicates how far based in the decoded output to look for the next phrase, the second indicates the length of that phrase, and the third is the next character from the input (which is strictly only necessary when the character has previously been unseen).

To decode, the triple is simply parsed, e.g., for the triple ⟨5, 3, b⟩, we go back 5 characters, copy the next 3 characters and then add the third item from the triple to this.

We can also have triples which have a recursive reference, e.g., ⟨1, 5, a⟩ which means go back one character, and then sequentially cope the next 5 characters. When the second character for copying is needed, it's available as a copy of the first, and this results in 5 consecutive copies of the previous character and an a.

Various issues must be addressed in implementing an adaptive dictionary method, such as how large pointers can be (references further back increase the chance of longer matching strings, but also increase the bits required to store the points, typical values are around a few thousand characters) and how large the strings referred to can be (again, the larger the string is allowed to be, the more bits are needed to represent the parameter - typically around 16 characters are allowed).

During encoding, the method of searching the window of prior text for the longest match with the upcoming phrase presents an issue - linear search is very inefficient, and it is best to index the prior text with a suitable data structure (e.g., trie, hash or binary search tree).

gzip

gzip is a popular high-performance implementation of LZ77. A hash table is used to locate the previous occurrences of strings. The hash is accessed by the next 3 characters and holds pointers to prior locations of those 3 characters.

Pointers and phrase lengths are then stored using variable length Huffman codes, which are computed semi-statically by processing 64 kb of data at a time. This can be held in memory much more easily, given the appearance of it being single-pass.

Pointer triples are reduced to pairs by eliminating the third element and common Huffman coding is used to transmit both phrase lengths and characters (this is transmitted first) and if a phrase length was transmitted, the next unit will be a pointer.

Good compression techniques work best on large files, as decompression techniques are inherently sequential. This tends to preclude random access which is needed by full-text retrieval systems, so special measures are required to facilitate random access for these applications.

Good compression methods make random access difficult because they use variable length codes so decoding can not start at a random point as there is no guaranteed way to find a codeword boundary, and adaptive models are also used, where models can not be determined with decoding all of the prior text.

It is best to use static models for full text retrieval, as there is no good solution for adaptive modelling.

Synchronisation

Techniques have been developed for achieving random access in compressed files.

One technique is synchronisation points. If you assume that the smallest unit of random access in a compressed archive is the document, then either the bit offset of a document is stored, or it is ensured that it ends on a byte boundary.

Another technique is to use self-synchronising codes, where the code is designed so that regardless of where decoding starts, it comes into synchronisation rapidly and stays there. This is still problematic for full-text retrieval as it is not possible to guarantee how quickly synchronisation will be achieved.

Performance considerations for compression algorithms include the speed, memory use and compression rate of the algorithm.

Several methods can be used to compare for performance, e.g., compressing a number of documents from a corpus (of different types, e.g., novels, bitmaps, C code, binaries, executable code, etc) using different algorithms and a simple 1-1 input-output mapping (e.g., cat) where the results in the different dimensions can be compared.

Automatic Summarisation

Summarisation, in the most general sense, is the task or art of abstracting key content from one or more informaton sources: "take an important source ... and present the most important content to the user in a condensed form and in a manner sensitive to the user's or application's needs" (Mani, 2002).

In this general view, the original content may be from various media (e.g., text, speech, video, etc) and the summary may be in the same or different medium (e.g., for video content such as a football match a textual summary might be provided).

In the most familiar cases, summarisation is of texts or documents, and this is the area which automatic summarisation focusses on.

Examples of summarisation include headlines and leaders on news websites (typically manually generated) or abstracts in academic reports. Familiar genre for condensation include headlines, outlines, minutes, biographies, abridgements, sound bites, movie summaries and chronologies.

Good text summaries should be considerably shorter than the input text, cover all the main points of the input text, be truth preserving and a good text (coherent) in its own right. Additional desirable characteristics may include flexibility with regards to length, the target audience and the task.

Text summaries are classified as being of different types (typology) along a number of dimensions:

  • indicative (summaries which help the user decide whether to read the full document or not) vs. informative (summaries which cover all the key information of the source, and serves as a surrogate for the full document) vs. critical (where the summary evaluates the content of the full document);
  • extract (consists of material copied from the input document, e.g., key sentences/phrases) vs. abstract (rephrasings/rewrites of the input and may contain material derived from, but not part of, the input document).
  • generic (summaries which capture the central ideas of a document, reflecting the author's own prioritisaiton of content) vs. query-based (summaries which emphasise content relevant to a specific information need from a user query) vs user-focussed (content selection is based on 'long-standing' interest/needs of an individual or group profile).
  • single-document (summaries from a single text source/document) vs. multi-document (summaries including content from multiple texts - such as the result of information retrieval from a specific query so all documents are on related subject matter). Multi-document summarisation in particular presents additional problems. Very high rates of compression are required, and there is a high degree of redundancy of content in the sources;
  • form of summary (e.g., text paragraph vs. bullet points);
  • user expertise (e.g., novice vs. expert - which content needs to be simplified, or background material included);
  • mono vs. cross-lingual summarisation (mostly the source and the summary are in the same language, but summarisation may be combined with translation to produce indicative summaries in a language different from the source text).

Automatic text summarisation algorithms take an input document, and typically a compression rate and a content preference indicator (e.g., a user profile or query). The process then typically consists of an analysis phase (forming an internal representationn of the document, e.g., a list of tokens or meaning propositions), a transformation phase (operations on this representations, e.g., scoring sentences for relevance or making inference over meaning) and then synthesis (producing an output - either printing the selected extracts or generating new texts).

Automatic summarisation approaches can be classed as either shallow or deep.

Shallow approaches typically use a word-level analysis and produce extracts. These summaries may lack coherence and cohesion and are subsequently not 'nice to read', and are typically knowledge-poor (or knowledge-light).

Deep approaches typically perform some level of semantic analysis and produce abstracts. These aim to be knowledge-rich, but tend to be domain dependent, and are difficult and costly to develop and subsequently adapt to new domains.

Shallow Approaches

Shallow approaches produce extraction-based summaries, with the extracts considered for inclusion typically sentences of the input document. Summarisation is then reduced to the task of sentence selection.

A number of different techniques and criteria have been used for selecting (scoring) the sentences to include in a summary, e.g., word frequency for indicating the "topic" of a document, cue phrases (conventional expressions signalling important content), positional methods (selecting important sections/regions of the document) or the title method (using the title or headings as key indicators of topic).

Word Frequencies

This was the first work done on automatic summarisation by Luhn in 1958. The claim is that frequently occuring content words (with reference to some background corpus) form keywords which indicate document topics, and clusters of keywords indicate summarising sentences.

Information retrieval style approaches can be used, which may employ various information retrieval ideas. Stoplists can help discount unhelpful terms, and stemming can be useful to identify keywords. IDF (with frequency in some background corpus) may be used, and keywords identified with some threshold criterion.

Keyword clusters, within a window, are then found and scored, and high scoring sentences selected. The Luhn method can therefore be considered a sentence selection approach.

Position Method

This method was developed by Edmundson in 1969 and uses the claim that important sentences occur in specific positions.

In certain genres, important sentences occur near to the start of the text. This is true, for example, in news text and lead-based summary methods (e.g., Brandow 1995) works surprisingly well for this genre.

In other genres, the introduction and conclusion paragraphs are particularly important.

This method works by first assigning scores to sentences according to their position within the paragraph, and then by the position of their paragraph within the entire document. High scoring sentences are then selected.

The Optimum Position Policy approach is used by Lin & Hovy, 1997. This agrees with the view that position is a valuable cue to locating important sentences, but argue that the particular positions are genre dependent (e.g., lead position important for news, maybe not for reviews, for example).

They advocate using a machine learning approach to learn the important positions for different genres.

Title Method

This approach, proposed by Edmundson in 1969 assumes that words in titles/headings are valuable indicators of the central topics of a document.

Titles and headings are then processed to identify a set of topic keywords. These key words can then be used in scoring for sentence selection (e.g., as in Luhn's 1958 frequent words method).

Cue Phrases

This method, also proposed in 1969 by Edmundson uses the idea that certain phrases indicate which sentences contain important content (e.g., "In conclusion", "In this paper, we show" and "Significantly").

The method may distinguish between bonus (words which increase a sentence's score e.g,. "important" or "significant") and stigma (words which decrease a sentence's score, e.g., hardly or impossible) words. These lists can be compiled automatically.

Feature Combination

Also proposed by Edmundson in 1969, this uses multiple cues to find important sentences.

The approach proposed by Edmundson uses a linear combination of contributions with weights. The four features used were: title, cue, keyword and position. The weights are then adjusted using training data with a minimisation technique (there are various methods to do this).

Edmundson found that the best combination was cue + title + position.

Kupiec, Pederson & Chen (1995) took a closer look at the specific combined approach and developed a system which uses multiple features and treats sentence selection as a modified case of naive Bayesian classification.

A classifier is trained using a corpus of document/summary pairs (which in this case was a collection of scientific journal articles with abstracts created by professional abstractors - average 3 sentences in length and mainly indicative).

A matching algorithm was used to select sentences in the full document that most closely correspond to abstract sentences, so the final data consisted of the articles and their selected sentences.

Five features were selected that used discrete-values:

  • Sentence Length Cut-off Feature - short sentences tend not to be included in summaries, so this feature is true is the sentence length is longer than some threshold;
  • Fixed-Phrase Feature - this is true is the sentence contains one of the list of key phrases, or it follows a section heading containing a key term;
  • Paragraph Feature - sentences are classified by their position in a paragraph: initial, medial or final;
  • Thematic Word Feature - the most frequent content words are designated theme words, and sentences are ranked based on the occurence of the theme words. The feature is then true if the sentence is among the highest ranked ones;
  • Uppercase Word Feature - proper names and acronyms are considered to be important, and the sentence is then ranked according to the occurrence of uppercase words/acronyms.

For each sentence s, the probability that it is in the summary S given its feature values (F1, ..., Fk) is computed (i.e., P(s ∈ S | F1, ..., Fk)). Probability is then used as a score to rank and select sentences. Top-ranked sentences are then selected to meet some compression requirement.

Bayes rule can then be used, and a naive Bayesian approach makes the assumption that the features describing a situation (usally the characteristics of an item and its context) are conditionally independent. This may not be true, but an approximation often produces an approach that works quite well.

Given this assumption, we can apply Bayes rule and as these quantities involve individual features, these are much easier to estimate than for the conjunction of features (due to data sparseness). Hence, we can rewrite:

P(s ∈ S | F1, ..., Fk) = (P(s ∈ S) ∏kj=1 P(Fj | s ∈ S)) / (&prodkj=1 P(Fj)).

For 25% summaries (compression), 83% precision can be achieved.

The extraction based method of constructing summaries has serious problems, particularly in terms of cohesion (whether text hangs together, often signalled by linguistic devices such as co-reference, lexical repetition, ellipsis, etc - text can be cohesive, but still non-sensical) and coherence (whether text makes sense, or is meaningfully connected, often rhetorical relations are involved, but text can be coherent without cohesion clues - extraction can damage cohesion links).

Extracted summaries can therefore be disjointed with regard to meanings, or even carry false implications, if co-references are incorrectly extracted.

Evaluation

Evaluation is required for system development (e.g., for adjusting system parameters to optimise performance, or to identify the contribution of a component for overall performance), for comparison of alternative systems and approaches, and for identifying when the system performs above some minimal threshold for commerical usage.

Evaluation approaches can either be intrinsic, which addresses the quality of the summary itself, or extrinsic evaluation, which addresses the impact of using the summary (vs. a full document) on the subjects' performance at some task.

Evaluation can be done by human judges (which is subjective and may result in disagreement between judges, as wel as being time-consuming and expensive), or automatically, which is more objective and repeatable, but only possible in limited kinds of evaluation. Additionally, gold standard reference summaries need to be created, which are costly to produce.

Extrinsic evaluation is task-based. It can consider both how accurately the task using the summary was done, and also how quickly. These tasks may include answering questions on the topic of the full document, deciding if the document is relevant to a given topic or not, or assigning a category to the document out of a list of categories.

Intrinsic evaluation addresses the quality of the summary itself and may ask whether or not it is coherent and cohesive, whether it covers the main topics of the document, and what important things are missing.

Automatic intrinsic evaluation only addresses whether the conceptual content of the reference summary is represented in the automatic summary, and can not evaluate for subtler requirements (e.g., coherence and cohesion).

Where reference and system summaries are both produced by extraction (e.g., of sentences), they can be evaluated on the basis of co-selection (i.e., does the system select all/only those sentences included in the reference summary). The metrics of precision (proportion of system summary sentences that are also in the reference summary) and recall (proportion of reference summary sentences that are included in the system summary) can then be used to score the co-selection.

In the case where they are abstracts, co-selection can not be used. Instead, we can attempt to assess the extent to which the content of the reference summary is represented in the system summary by using some measure of textual overlap.

One such approach is the ROUGE (Recall-Oriented Understudy for Gisting Evaluation) system. This implements a range of different metrics for textual overlap, and computes multiple overlap scores for comparison of the system summary to one or more reference summaries.

Some ROUGE metrics includes: ROUGE-N (n-gram co-occurrence), ROUGE-L (the longest common subsequence, possibly skipping words) and ROUGE-W (the weighted longest common subsequence, favouring consecutive matches).