oscan

An Oscan text from Pompeii, Italy, 3rd or 2nd century BCE, currently in the British Museum. Photograph by Reuben Pitts.

.

Guidelines

Comparative concepts

A typological database must accurately describe not one, but many different languages. In order to do so, it makes use of comparative concepts, 'that is, concepts specifically designed for the purpose of comparison that are independent of [language-specific] descriptive categories' (Haspelmath 2010). In other words, the basis of comparison in the LitTEL features should always appeal to general conceptual and formal notions, rather than assuming that terms like 'adjective' or 'object' have cross-linguistically consistent instantiations.

 

Tailoring features to Little Corpora

LitTEL aims to ask typological questions of little corpora. Doing so without bias requires a fine-tuned methodology.

Suppose one were to annotate Venetic, for instance, for the WALS features. Many features (like the existence of an inclusive - exclusive distinction in the first person pronoun) would plainly be impossible to annotate due to a lack of data. This is, in itself, not a major problem: we can still theoretically compare the 'little' Venetic dataset with a 'big' language like Latin by eliminating the features for which Venetic attests no data.

For other features, however - the number of nominal cases, for instance - the annotator could certainly enter a value. Five are in evidence, but who is to say that Venetic did not possess more cases that are simply not attested? This problem is more pernicious, as it means the annotation of Venetic will never be properly comparable to the annotation of, say, Latin, where we have enough data to be sure that no cases are missing. At what point can a researcher conclude they have enough data to decide that they have a complete inventory of cases, or that case marking is fully obligatory, or that the position of case markers is not syntactically free?

Such decisions are always arbitrary, and will remain an intractable source of observation bias in a dataset comprising both little and big corpus languages. The LitTEL features are defined to avoid this issue of observation bias as much as possible.

Thus, LitTEL avoids asking 'are there any...' questions. Likewise, LitTEL features do not make reference to statistical concepts such as 'obligatoriness', nor do they presuppose knowable variation, or appeal to concepts which require such variation to establish (e.g. degree of syntactic freedom). Instead, LitTEL features first specify a construction (or multiple constructions), and then - if the construction is attested - they define two possible, mutually exclusive, properties of this construction. Since a construction is either attested or it is not, little languages may be more likely to contain NA values where relevant constructions are simply not attested (see below), but they are not more likely to contain a specific linguistically relevant value (0 or 1).

These desiderata are schematised below with examples.

Potential LitTEL feature Status Comment
How many cases are attested? NO The answer is likely to be correct for well attested languages, while figures for poorly attested language may be incomplete in unknowable ways.
Is there an accusative case? NO In a poorly attested language, an accusative case may exist but simply not be attested, which is much less likely to be the case for a well attested language.
Is accusative marking obligatory? NO 'Obligatory' is a statistical concept which requires large datasets to reasonably establish, and may give misleading results when applied to small corpora.
Can the accusative marker be separated from its head noun? NO Smaller corpora will attest less variation and are thus less likely to provide evidence for relevant syntactic variation than larger corpora.
In overt accusative constructions with an attribute, does that attribute agree? YES If at least one relevant construction is attested, it either agrees or it does not. This is true for both poorly attested and well attested languages.

Note that in some cases, features appeal to concepts such as lexical open-endedness or productivity. Although strictly speaking these concepts contradict the above desiderata, it is assumed that a linguist will have a sense of where these criteria apply even on limited evidence. For instance, a linguist will be aware that a locative attested only for placenames does not constitute good evidence for the existence of a locative case which could apply productively to the entire lexicon. Such a common-sense judgement can arguably be made based even on only a single construction.

 

Binary feature values

All features in the LitTEL feature set are binary. This means that they have two possible answers, coded as 0 (zero) and 1 (one). For instance, the first LitTEL feature describes the word order of coordinative morphs, and has the following two values:

Value Description
0 The coordinator is placed medially between the two coordinands (template A coord B).
1 The coordinator is placed after the second coordinand (template A B coord).

The previous section drew attention to the issue of observation bias. LitTEL addresses this problem by avoiding the use of yes-no questions to obtain binary feature values, and instead describing both values in their own terms. Of course, this means that values must be formulated in such a way as to exclude logically possible third options in a typologically informed manner.

For instance, a logically possible third alternative for coordination - in addition to the templates [A coord B] (zero) and [A B coord] (one) shown above - is [coord A B], with the coordinative morpheme preposed before the first coordinand. However, there do not seem to be any natural human languages which instantiate this construction. In this way, existing typological knowledge informs the binary value set.

Existing typological knowledge also informs the coding of the two binary values. The value instantiated by 0 (zero) is the value which, based on currently available information, is more common in the languages spoken today. For the example given above, medial coordination appears to be significantly more common in the languages of the world than postpositional coordination (Stassen 2003): consequently, medial coordination is 0 (zero) while postpositional coordination is 1 (one).

 

Annotating competing constructions

niger

An early Latin inscription. Photograph by Reuben Pitts.

Human language is characterised by variation. Consequently, it is likely that in many cases languages - particularly 'big' corpus languages - will attest constructions answering to both values (0 and 1) of a given feature. In such cases, a decision is made using the following criteria:

  • Is there evidence that one construction is more frequent than the other? If so, this value takes precedence when annotating the language in question. In the best case, this quantitative criterion is invoked based on corpus data.
  • Is there evidence that one construction is more pragmatically neutral than the other? If so, this value takes precedence when annotating the language in question. For instance, some languages use word order productively to change the salience of particular constituents: in such cases, the most 'neutral' word order is the most relevant for cross-linguistic comparison.
  • Is there evidence that one construction is syntactically or semantically more free than the other? If so, this value takes precedence when annotating the language in question. A construction which only coordinates semantically related coordinands, for instance (such as 'sun and moon', 'father and mother') is dispreferred over a construction which can coordinate any constituent, regardless of its semantic properties.

In many cases, definitions will be formulated in such a way as to help the annotator discriminate between competing constructions. In particular, many definitions make reference to open-ended lexical classes. This helps to clarify that pronouns, for instance, which often have marginal constructions of their own, are less relevant to annotating values than lexical nouns.

LitTEL deliberately does not employ 'both' as an annotational value. This is because the extent of attested variation is strongly dependent on the degree and type of a language's documentation and would consequently risk introducing observation bias to the dataset. Where two constructions are attested and no choice can be made through any criterion, the annotation is NA ('not applicable'), as described in the following section.

 

Annotating values as NA

In addition to 0 (zero) and 1 (one), a third value NA ('not applicable') is available for annotational purposes. These are values which any operations in the pylittel module will ignore (and which do not feature on the parameter pages on this web app). Consequently, this is a value which essentially eliminates the language in question from further consideration and should be used sparingly. NA fields essentially exist to keep track of datapoints for which an annotator could not find a satisfactory value.

NA may be used in the following cases only:

  • When the attestation of a language is insufficient to establish the correct value of a feature. Although LitTEL aims to be maximally suited for the description of poorly attested languages, it will still occasionally be the case that simply no relevant constructions are attested. Note that, all other things being equal, a single attested construction is considered sufficient to enter a value.
  • Where the language employs a device that differs from, or is incompatible with, both alternatives given. In general, the feature values should be described in a way that precludes this possibility, but exceptions are conceivable. For instance, a language may use only zero-marked coordination, in which case the position of the coordinative morph is undefined. In such a case, NA is the only meaningful value.
  • Where both alternatives exist but are either so rare or so well-balanced that the criteria in the previous section cannot distinguish between them.