Manual EN
Manual EN
uk
The development of #LancsBox was supported by the Economic and Social Research Council (grant number
EP/P001559/1, ES/K002155/1 and ES/R008906/1)
1
6.1 Text: Overview............................... 27
Contents 7 Wizard ..................................................... 29
1 Downloading and running #LancsBox X ....4
7.1 Wizard: An overview ..................... 29
2 Adding corpora ..........................................6
7.2 R code ............................................ 30
2.1 Visual summary: Corpus hub ...........6
8 Searching in #LancsBox ........................... 32
2.2 My data ............................................6
9 spaCy POS tagset: English ....................... 36
2.3 Web ..................................................8
10 spaCy dependency tags ..................... 37
2.4 Exporting corpora ............................8
11 CLAWS tagset (C7) ............................. 38
2.5 Draft corpora....................................9
12 USAS semantic tagset ........................ 42
3 KWIC tool (Key Word In Context) ........... 10
13 Definitions of smart searches ............ 45
3.1 KWIC: An overview........................ 10
14 Glossary.............................................. 48
3.2 Multiple panels ............................. 11
3.3 Metadata columns ........................ 12
3.4 Filters ............................................ 12
3.5 Summary table .............................. 13
3.6 Working with subcorpora ............. 14
4 GraphColl ................................................ 16
4.1 GraphColl: An overview ................ 16
4.2 Producing a collocation graph ...... 17
4.3 Reading Collocation Tables ........... 17
4.4 Reading collocation graph ............ 18
4.5 Extending graph to a collocation
network ...................................................... 19
4.6 Shared collocates .......................... 20
4.7 Problems with graphs:
overpopulated graphs ................................ 21
4.8 Reporting collocates: CPN............. 22
5 Words tool .............................................. 23
5.1 Words: Overview .......................... 23
5.2 Producing frequency lists.............. 24
5.3 Producing keywords and key n-
grams 24
5.4 Word cloud ................................... 25
6 Text tool ................................................. 27
2
Developed @ Lancaster University 3
#LancsBox X: License
#LancsBox is licensed under BY-NC-ND Creative commons license. #LancsBox is free for non-commercial
use. The full license is available from: http://creativecommons.org/licenses/by-nc-nd/4.0/legalcode
#LancsBox is a new-generation corpus analysis tool. Version X has been designed for 64-bit operating
systems (Windows 64-bit, Mac and Linux) that allow the tool’s best performance.
Select and download: Select the version suitable for your operating system and download installer to
your computer.
Or simply click on
Run installer
#LancsBox is safe to run. Double-click on the installer file and follow the steps in the installer. Always install
#LancsBox to a folder, where the tool has ‘read and write’ privileges such as the Users folder (default) or
Desktop; On Windows, never install #LancsBox to Program Files.
Windows
4
Please note that you may need to give the installer the privileges to run on your machine. On Windows,
you might be asked for admin password.
On Mac, click on the Apple icon> System settings> Privacy & Security
Scroll down to Security, where you should be able to see ‘#LancsBox X Installer app’. Click on ‘Open
Anyway’.
5
2 Adding corpora
#LancsBox X is designed for very large corpora; it natively supports XML, which allows working with rich metadata. Data
can be imported into #LancsBox very easily in any format (txt, docx, pdf..). #LancsBox also has a powerful web scraping
functionality.
From any tool, you can add more corpora by clicking the corpus name and selecting the “add corpora”
option from the dropdown menu.
You can:
▪ Preview a list of available corpora in Corpus hub.
▪ Download existing corpora such as the BNC2014.
▪ Load your own data under My data.
▪ Create corpora from the Web.
Tip: You can adjust the zoom level using the keyboard shortcuts Ctrl - and Ctrl + (Cmd - and Cmd
+ on a Mac).
2.2 My data
#LancsBox allows you to work with your own corpora. #LancsBox supports a wide range of file formats (txt, docx, pdf,
pptx, xlsx…) or XML.
6
.txt XML with w elements
We can pick up on the last comment.
Once we are in the grip of reflective
thinking it is very hard, if not impossible,
for us to see our ethical justifications of
our ethical concepts, say, in a genuine
way: we will always be drawn to the
thought that this is all local. In addition,
we will no longer see such judgements
as embodying any sort of knowledge.
2. On the ‘My data’ tab provide information about the corpus and navigate to the data (individual files or folders
with subfolders). You can also drag and drop data into the box.
3. You can also automatically annotate (tag) corpus for pos, headword, grammatical relation and
semantic (USAS) category.
4. Click on ‘Load corpus’.
5. Once the corpus is loaded, click on ‘Continue’
7
2.3 Web
#LancsBox allows you to easily scrape data from the web and create your own corpus.
1. On the ‘Web tab provide information about the corpus you want to create (name, language).
2. Paste a list of URLs, which you want to scrape at depth 1.
3. Decide on the additional parameter or leave defaults.
4. Click on ‘Create corpus’.
5. Once the corpus is created, click on ‘Continue’
#LancsBox allows you to export corpora in XML. This functionality is available for corpora with unrestricted access.
Hover your mouse over the name of a corpus and click on the ‘Export’ icon.
8
2.5 Draft corpora
#LancsBox allows you to pause corpus processing and return to it later; corpora, which are being processed (and
optionally tagged) are also backed up at regular points, which allows returning to the last saved point should something
go wrong with the process. Incomplete corpora are available under ‘Drafts’
To continue processing a corpus, select the appropriate corpus from the list and click on ‘Resume
corpus’.
9
3 KWIC tool (Key Word In Context)
The KWIC tool generates a list of all instances of a search term in a corpus in the form of a concordance.
It can be used, for example, to:
■ Find the frequency of a word or phrase in a corpus.
■ Find frequencies of different word classes such as nouns, verbs, adjectives.
■ Find complex linguistic structures such as the passives, split infinitives etc. using ‘smart searches’.
■ Sort concordance lines.
■ Compare multiple analyses side-by-side.
The following is a simple, yet efficient design of the KWIC tool. The single search box allows users to
carry out a wide variety of powerful searches.
Select subcorpus
Select corpus
Click a row in a table to select it. Hold the Ctrl or Cmd key while clicking to select multiple rows. Selected
rows can be copied with the Ctrl+C / Cmd+C keyboard shortcut or by right clicking the table and
selecting the “Copy” option.
10
Results can be also saved easily from the main menu, where ‘Save’ or ‘Save all’ can be selected to
save the active panel (highlighted) or all panels respectively.
#LancsBox X allows analyses in multiple panels. Panels can be re-arranged by clicking and dragging on
the top part of the window.
Multiple panels can be selected by holding down the Ctrl or Cmd key while clicking tools. This can be
used to perform the same search in multiple panels at once.
Summary of results
Table settings
Progress bar
11
3.3 Metadata columns
Efficient work with metadata is at the heart of #LancsBox X. The concordance table displays different
types of meta-data. Columns can be added according to the users’ need. These columns can be sorted
and filtered to display relevant information. To add or remove columns in a table, click on the table
settings menu ( ) and select items from the “Columns” submenu.
Add/remove columns
3.4 Filters
Powerful filters can be applied to i) linguistic and ii) metalinguistic data. Simply hover the mouse pointer
towards the right of any column header to find the filter options button .
Linguistic data can be filtered using the complete linguistic search functionality. For the left and the right
context, choose the position(s) where the required linguistic feature should occur.
12
Metalinguistic data can be filtered according to three data types: i) categories, ii) numbers and iii) dates.
Data displayed as concordance lines in KWIC can also be summarised using the ‘Summary table’
functionality . Summary tables can be applied to both i) linguistic and ii) metalinguistic data.
• Linguistic summaries include the following pieces of information: i) hits (absolute frequency),
ii) number of texts, in which the linguistic feature occurs and iii) break-down according to any
other available linguistic annotation such as pos-tags, semantic tags (usas), headwords (hw) etc.
representing the linguistic feature in focus.
For example, the table above shows that at the L1 position in the concordance table the most
frequent word is the, followed by this, first, same… It occurs with the absolute frequency of 26,991
13
at the L1 position in 3,892 different texts. In this position, the is tagged as two pos-tags AT and RT42
and 9 different semantic usas tags. The details about the tags and their frequencies are revealed in
tooltips with the mouse-over functionality.
• Meta-data summaries show a break-down according to a selected category. They include the
following pieces of information: i) size of the component, ii) hits (absolute frequency) in the
component, iii) relative frequency in the component, and iv) number of texts in which the
linguistic feature occurs in the component out of all texts in the component.
Summary tables can be copied & pasted or saved; saving will also include a break-down by individual
tags displayed in tooltips.
#LancsBox X allows users to define subcorpora. In this way, you can restrict searches to specific parts of
a corpus. To define a new subcorpus, click the subcorpus dropdown and select the “new subcorpus”
option.
In the overlay that opens you can select the criteria for defining your subcorpus and choose a name.
Click “OK” once all criteria have been chosen. Your new subcorpus will be selected.
14
Auto filter
Date
Number
You can change subcorpus using the subcorpus dropdown. The edit and delete buttons in the dropdown
allow you to change or remove the subcorpora you’ve defined.
15
4 GraphColl
The GraphColl tool identifies collocations and displays them in a table and as a collocation graph or
network.
It can be used, for example, to:
■ Find the collocates of a word or phrase.
■ Find colligations (co-occurrence of grammatical categories).
■ Visualise collocations and colligations.
■ Identify shared collocates of words or phrases.
■ Summarise discourse in terms of its ‘aboutness’.
Settings
Collocation table
16
4.2 Producing a collocation graph
GraphColl produces collocations tables and graphs on the fly. After selecting the appropriate settings you
can start searching for the node and its collocates.
A collocation table is a traditional way of displaying collocates. In GraphColl, the table shows the following
pieces of information for each collocate: i) distribution, ii) collocation frequency and iii) frequency of the
collocate anywhere in the corpus, iv) all relevant statistical measures. By default, the table is sorted
(largest-smallest) according to the default collocation statistic, log Dice, and an appropriate frequency
filter is applied.
17
2. The meaning of the individual columns is:
i) Collocate: shows the collocate in question.
ii) Distribution: shows a bar chart indicating the textual position of the collocate (e.g. in the
L5-R5 span).
iii) Freq (coll): displays the frequency of the collocation (combination of node + collocate).
iv) Freq (corpus): displays the frequency of the collocate anywhere in the corpus.
v) Stats (names): displays the values of the selected association measures; all available
measures are computed at once. To display more or fewer click on the ‘+’ button.
The graph displays multiple dimensions according to the table settings (right-click on table header to
assign a graph value to a column).To find out more about a collocate, hover your mouse over it to obtain
concordance lines (KWIC preview), in which the collocates co-occurs with the node.
1. Edge length: By default, the edge (line) length is assigned to a default association measure to
express the strength of collocation. The closer the collocate is to the node, the stronger the
association between the node and the collocate (‘magnet effect’).
2. Size: The size of each collocate circle is by default assigned to frequency of the collocation value:
Freq (coll). The more frequent the collocation is the larger the circle.
3. Colour: The colour of each circle is by default assigned to the frequency of the collocate anywhere
in the corpus: Freq (corpus). The frequency range is displayed in the legend.
4. Position: The position of collocates around the node in the graph reflects the exact position of the
collocates in text: some collocates appear (predominantly) to the left of the node, others to the
right; others appear to the left and right at a similar frequency (middle position in the graph). For
ease of display, if multiple collocates appear in a similar position and overlap, the tool ‘spreads
out’ the collocates slightly.
18
4.5 Extending graph to a collocation network
A collocation network is an extended collocation graph that shows i) shared collocates and ii) cross-
associations between several nodes.
1. To expand a simple collocation graph into a collocation network, either search for more nodes or
left-double-click on a collocate in the graph.
2. A collocation network displays nodes with unique collocates (outer rim of the graph) and shared
collocates (middle of the graph).
Shared collocates in a table
Graph filter
Node
19
4.6 Shared collocates
Shared collocates are collocates shared by at least two nodes in a graph. Shared collocates are displayed
in the middle of the graph with links to the relevant nodes.
1. A full list of shared collocates can be obtained by clicking on the ‘i’ icon .
2. The list of shard collocates is displayed in a tabular form.
20
4.7 Problems with graphs: overpopulated graphs
If a collocation graph or network includes too many nodes and collocates, it becomes difficult to interpret.
This is referred to as an overpopulated graph/network. The solution is either to change the filters in the
table and make the threshold values more restrictive or to apply a filter to the graph.
The following figure shows an overpopulated graph on the left and a graph that is more easily interpretable
on the right.
21
4.8 Reporting collocates: CPN
It is important to realise that there is no one definite sets of collocates: different statistical procedures
and threshold values highlight different sets of collocates. We therefore need to report the statistical
choices involved in the identification of collocations using standard notation called Collocation Parameters
Notation (CPN). When saving the results, GraphColl saves the settings in the form of CPN.
Brezina et al. (2015) propose CPN as a specific notation to be used for accurate description of collocation
procedure and replication of the results. The following parameters are reported:
22
5 Words tool
The Words tool allows in-depth analysis of frequencies of words, n-grams, skip-grams, grammatical and
semantic categories, as well as comparisons of corpora using the keywords technique.
Word cloud
Summary stats + Display more stats
Frequency Graph
23
5.2 Producing frequency lists
When the tool is opened, Words displays a frequency list (table) based on the default corpus and default
settings. These settings can be changed easily to produce different frequency lists.
Note: Please note that Frequency lists in #LancsBox X are pre-computed and stored for
later use. If you are creating a wordlist for the first time, this might take some time
depending on the size of the corpus and complexity of its annotation (number of units
used).
The Words module computes a comparison of frequencies between two corpora/wordlists using a
selected statistical measure.
1. Click on the key icon at the top right corner of the table .
2. Select the appropriate reference corpus.
24
3. Sort and/or filter according to your preferred keyword statistics (Simple maths is used by default
for sorting).
The Words module creates word clouds based on words, n-grams, grammatical and semantic structures. Word
clouds can be assigned different statistical properties from the table indicated by i) position, ii) font size and iii)
colour in the graph.
25
Did you know?
The statistical technique of keyword analysis was originally developed by Mike Scott (1997) and
it was implemented in WordSmith Tools. It relied on corpus comparison using the chi-squared
test or the log-likelihood test. As Kilgarriff pointed out, the chi-squared test and the log-likelihood
test are not entirely appropriate for this type of comparison. Kilgarriff’s solution implemented in
Sketch Engine was to compare corpora using a ‘simple maths’ procedure, a simple ratio between
relative frequencies of words in the two corpora we compare. In addition to ‘simple maths’,
#LancsBox offers also other types of solutions for corpus comparison.
Scott, M. (1997). PC analysis of key words—and key key words. System, 25(2), 233-245.
Kilgarriff, A. (2009, July). Simple maths for keywords. In Proceedings of the Corpus Linguistics Conference. Liverpool, UK.
26
6 Text tool
The Text tool provides an overview of all files (texts) in the corpus, their size and lexical diversity. It also
allows in-depth analysis of individual texts in the full view mode. The tool also searches texts and offer an
overview table with a breakdown of frequencies and relative frequencies per file. The tool also highlights
search terms in individual texts.
View annotation
Graph filter
Right-click: assign value relevant to graph
Left-click: sort
Graph showing
individual files
Left: Overview table or full text view. Right: Visualizing corpus files
27
Summary stats
Relative frequencies
visualized (colour)
28
7 Wizard
The Wizard tool allows batch searching of corpora and running statistical analyses on the results. Wizard
implements the R package, which can be used to run simple and complex statistical analyses inside #LancsBox.
To start the Wizard tool, click on the Wizard icon in the top right corner of the search bar.
Data tab
Select tool
Choose settings
All search processes from the Data tab run in the background and are displayed in the bottom
right corner of the tool. Progress is indicated by a blue circle; running searches can be
cancelled by clicking on the cancel button into which the icon turns on mouse over.
29
Processing tab
Preview/edit R code
R textual output
R visual output
7.2 R code
1. To refer to individual tables from previewed Tables, use tables[[1]], tables[[2]], tables[[3]]
3. To request user input use ‘readline’; provide input and press enter.
30
4. To load an R library, use ‘library(name_od_library)’; not all libraries are currently supported.
For a list of supported libraries (please read the small print on the functionality within those
libraries), please see: https://packages.renjin.org/packages. Much of the functionality from
individual libraries, if not currently supported, can be taken over by core R functions, which is
always available.
31
8 Searching in #LancsBox
#LancsBox offers powerful searches at different levels of corpus annotation using i) simple searches, ii)
wildcard searches, iii) smart searches, iv) CQL searches.
1. Simple searches are literal searches for a particular word (new) or phrase (New York Times). Simple
searches are case insensitive; this means that new, New, NEW, NeW etc. will return the same set
of results.
2. Wildcard searches are searches including asterisk *as a special character.
Special character Meaning Example of use
* 0 or more characters new* [new, news, newly, newspaper…]
any word [with space] new *[new car, New York, new ideas…]
3. Punctuation searches:
To search for punctuation use forward slashes as in the examples below.
/?/
hello /,/
4. Smart searches are searches predefined in the tool to offer users easy access to complex searches;
smart searches are unique to #LancsBox. These searches are used for searching for word classes
(NOUN, VERB etc.), complex grammatical patterns (PASSIVE, SPLIT_INFINITIVE etc.) and semantic
categories (PLACE_ADVERB).
32
NEGATION
NOMINALIZATION
NOUN
NUMBER
PARTICLE
PASSIVE
PAST_PARTICIPLE
PAST_TENSE
PEOPLE
PEOPLE
PERFECT_INFINITIVE
PHRASAL_VERB
PLACE_ADVERB
PLANET
PREPOSITIONAL_PHRASE
PRESENT_TENSE
PRONOUN
PROPER_NOUN
REFLEXIVE_PRONOUN
SHORT_WORD
SPLIT_INFINITIVE
SUPERLATIVE
SUPERNATURAL
SUPERNATURAL
SWEARWORDS
TECHNOLOGY
TIME
TIME_ADVERB
VERB
33
5. CQL (Corpus Query Language searches. #LancsBox supports powerful searches using CQL.
These can be used for defining complex searches at different levels of annotation.
The levels of annotation and syntax depend on the tagging of the corpus, but for XML corpora it is
common to have i) word, ii) headword/lemma (hw), iii) part-of-speech (POS), and iv) a user-defined
tag. For example, a single token can be searched in CQL with
This will match every instance of the word goes with the headword go, the part-of-speech tag V.*
(verb) and the usas tag M1 (Moving, coming and going). If a level of annotation is not specified, no
restriction is applied at that level. Everything in double quotes is interpreted as a case insensitive
regular expression.
To make queries case sensitive use double equals as in the example below:
[word=="US"]
To make negative searches use a combination of an exclamation mark and the equals sign, which
means ‘is not equal to’ as in the example below:
[word!="new"]
To search for punctuation use forward slashes and the attribute punc as in the example below. Note
that special characters such as the question mark or the full stop need to be escaped by using the
backlash symbol \
/punc="\?|\.|,|;"/
Multiple tokens can be placed in sequence. An empty pair of square brackets [] will match any token.
Tokens can be repeated X times using the syntax {X}, and repeated anywhere between Y and Z times
using the syntax {Y, Z}. The shorthand for {0, 1} is a question mark. Thus, for instance, the following
CQL expression
is interpreted as a verb to be (VB.*) followed by between 0 and 3 tokens without restriction ([]{0,3})
and optionally followed by the past participle (V.N).
Parts of a query can also be wrapped in parentheses (), allowing a quantifier such as {1,2} to apply to
sequence of tokens—e.g. ([pos="N.* "] [word="and"]){2}. Words, phrases and smart searches can be
used anywhere CQL tokens can—e.g. very{2} ADJECTIVE{1,2} [hw="year"].
34
CQL also supports searching XML structure. This search matches every <u></u> element, representing
utterances: <u/>. The following matches every utterance where the n attribute is 1 and the nationality
attribute is British or American:
These element queries can be combined with the other types of queries using the within syntax:
[pos="D.* "] green NOUN within <text genre="newspapers"/>
This query matches every instance of a determiner followed by “green” followed by a noun within
newspaper texts. The left and right hand sides of the within query can be anything; they can also be
other within queries:
(<emoji/> within please) within (<e/> within <text genre="elanguage"/>)
35
9 spaCy POS tagset: English
JJ adjective TO infinitival to
36
10 spaCy dependency tags
37
11 CLAWS tagset (C7)
Source: http://ucrel.lancs.ac.uk/claws7tags.html
38
JJ general adjective
JJR general comparative adjective (e.g. older, better, stronger)
JJT general superlative adjective (e.g. oldest, best, strongest)
JK catenative adjective (able in be able to, willing in be willing to)
MC cardinal number,neutral for number (two, three..)
MC1 singular cardinal number (one)
MC2 plural cardinal number (e.g. sixes, sevens)
MCGE genitive cardinal number, neutral for number (two's, 100's)
MCMC hyphenated number (40-50, 1770-1827)
MD ordinal number (e.g. first, second, next, last)
MF fraction,neutral for number (e.g. quarters, two-thirds)
ND1 singular noun of direction (e.g. north, southeast)
NN common noun, neutral for number (e.g. sheep, cod, headquarters)
NN1 singular common noun (e.g. book, girl)
NN2 plural common noun (e.g. books, girls)
NNA following noun of title (e.g. M.A.)
NNB preceding noun of title (e.g. Mr., Prof.)
NNL1 singular locative noun (e.g. Island, Street)
NNL2 plural locative noun (e.g. Islands, Streets)
NNO numeral noun, neutral for number (e.g. dozen, hundred)
NNO2 numeral noun, plural (e.g. hundreds, thousands)
NNT1 temporal noun, singular (e.g. day, week, year)
NNT2 temporal noun, plural (e.g. days, weeks, years)
NNU unit of measurement, neutral for number (e.g. in, cc)
NNU1 singular unit of measurement (e.g. inch, centimetre)
NNU2 plural unit of measurement (e.g. ins., feet)
NP proper noun, neutral for number (e.g. IBM, Andes)
NP1 singular proper noun (e.g. London, Jane, Frederick)
NP2 plural proper noun (e.g. Browns, Reagans, Koreas)
NPD1 singular weekday noun (e.g. Sunday)
NPD2 plural weekday noun (e.g. Sundays)
NPM1 singular month noun (e.g. October)
NPM2 plural month noun (e.g. Octobers)
PN indefinite pronoun, neutral for number (none)
PN1 indefinite pronoun, singular (e.g. anyone, everything, nobody, one)
PNQO objective wh-pronoun (whom)
PNQS subjective wh-pronoun (who)
PNQV wh-ever pronoun (whoever)
39
PNX1 reflexive indefinite pronoun (oneself)
PPGE nominal possessive personal pronoun (e.g. mine, yours)
PPH1 3rd person sing. neuter personal pronoun (it)
PPHO1 3rd person sing. objective personal pronoun (him, her)
PPHO2 3rd person plural objective personal pronoun (them)
PPHS1 3rd person sing. subjective personal pronoun (he, she)
PPHS2 3rd person plural subjective personal pronoun (they)
PPIO1 1st person sing. objective personal pronoun (me)
PPIO2 1st person plural objective personal pronoun (us)
PPIS1 1st person sing. subjective personal pronoun (I)
PPIS2 1st person plural subjective personal pronoun (we)
PPX1 singular reflexive personal pronoun (e.g. yourself, itself)
PPX2 plural reflexive personal pronoun (e.g. yourselves, themselves)
PPY 2nd person personal pronoun (you)
RA adverb, after nominal head (e.g. else, galore)
REX adverb introducing appositional constructions (namely, e.g.)
RG degree adverb (very, so, too)
RGQ wh- degree adverb (how)
RGQV wh-ever degree adverb (however)
RGR comparative degree adverb (more, less)
RGT superlative degree adverb (most, least)
RL locative adverb (e.g. alongside, forward)
RP prep. adverb, particle (e.g about, in)
RPK prep. adv., catenative (about in be about to)
RR general adverb
RRQ wh- general adverb (where, when, why, how)
RRQV wh-ever general adverb (wherever, whenever)
RRR comparative general adverb (e.g. better, longer)
RRT superlative general adverb (e.g. best, longest)
RT quasi-nominal adverb of time (e.g. now, tomorrow)
TO infinitive marker (to)
UH interjection (e.g. oh, yes, um)
VB0 be, base form (finite i.e. imperative, subjunctive)
VBDR were
VBDZ was
VBG being
VBI be, infinitive (To be or not... It will be ..)
VBM am
40
VBN been
VBR are
VBZ is
VD0 do, base form (finite)
VDD did
VDG doing
VDI do, infinitive (I may do... To do...)
VDN done
VDZ does
VH0 have, base form (finite)
VHD had (past tense)
VHG having
VHI have, infinitive
VHN had (past participle)
VHZ has
VM modal auxiliary (can, will, would, etc.)
VMK modal catenative (ought, used)
VV0 base form of lexical verb (e.g. give, work)
VVD past tense of lexical verb (e.g. gave, worked)
VVG -ing participle of lexical verb (e.g. giving, working)
VVGK -ing participle catenative (going in be going to)
VVI infinitive (e.g. to give... It will work...)
VVN past participle of lexical verb (e.g. given, worked)
VVNK past participle catenative (e.g. bound in be bound to)
VVZ -s form of lexical verb (e.g. gives, works)
XX not, n't
ZZ1 singular letter of the alphabet (e.g. A,b)
ZZ2 plural letter of the alphabet (e.g. A's, b's)
41
12 USAS semantic tagset
Source: http://ucrel.lancs.ac.uk/usas
42
I3.1 Work and N3.8 Measurement: Speed Q4.2 The Media:-
employment: Generally N4 Linear order Newspapers etc.
I3.2 Work and N5 Quantities Q4.3 The Media:- TV, Radio
employmeny: Professionalism N5.1 Entirety; maximum and Cinema
I4 Industry N5.2 Exceeding; waste S1 SOCIAL ACTIONS,
K1 Entertainment N6 Frequency etc. STATES AND PROCESSES
generally O1 Substances and S1.1 SOCIAL ACTIONS,
K2 Music and related materials generally STATES AND PROCESSES
activities O1.1 Substances and S1.1.1 SOCIAL ACTIONS,
K3 Recorded sound etc. materials generally: Solid STATES AND PROCESSES
K4 Drama, the theatre and O1.2 Substances and S1.1.2 Reciprocity
showbusiness materials generally: Liquid S1.1.3 Participation
K5 Sports and games O1.3 Substances and S1.1.4 Deserve etc.
generally materials generally: Gas S1.2 Personality traits
K5.1 Sports O2 Objects generally S1.2.1 Approachability and
K5.2 Games O3 Electricity and Friendliness
K6 Childrens games and electrical equipment S1.2.2 Avarice
toys O4 Physical attributes S1.2.3 Egoism
L1 Life and living things O4.1 General appearance S1.2.4 Politeness
L2 Living creatures and physical properties S1.2.5 Toughness;
generally O4.2 Judgement of strong/weak
L3 Plants appearance (pretty etc.) S1.2.6 Sensible
M1 Moving, coming and O4.3 Colour and colour S2 People
going patterns S2.1 People:- Female
M2 Putting, taking, pulling, O4.4 Shape S2.2 People:- Male
pushing, transporting &c. O4.5 Texture S3 Relationship
M3 Vehicles and transport O4.6 Temperature S3.1 Relationship: General
on land P1 Education in general S3.2 Relationship:
M4 Shipping, swimming Q1 LINGUISTIC ACTIONS, Intimate/sexual
etc. STATES AND PROCESSES; S4 Kin
M5 Aircraft and flying COMMUNICATION S5 Groups and affiliation
M6 Location and direction Q1.1 LINGUISTIC ACTIONS, S6 Obligation and
M7 Places STATES AND PROCESSES; necessity
M8 Remaining/stationary COMMUNICATION S7 Power relationship
N1 Numbers Q1.2 Paper documents and S7.1 Power, organizing
N2 Mathematics writing S7.2 Respect
N3 Measurement Q1.3 Telecommunications S7.3 Competition
N3.1 Measurement: General Q2 Speech acts S7.4 Permission
N3.2 Measurement: Size Q2.1 Speech etc:- S8 Helping/hindering
N3.3 Measurement: Communicative S9 Religion and the
Distance Q2.2 Speech acts supernatural
N3.4 Measurement: Volume Q3 Language, speech and T1 Time
N3.5 Measurement: Weight grammar T1.1 Time: General
N3.6 Measurement: Area Q4 The Media T1.1.1 Time: General: Past
N3.7 Measurement: Length Q4.1 The Media:- Books T1.1.2 Time: General:
& height Present; simultaneous
43
T1.1.3 Time: General: Future X2.5 Understand X9.1 Ability:- Ability,
T1.2 Time: Momentary X2.6 Expect intelligence
T1.3 Time: Period X3 Sensory X9.2 Ability:- Success and
T2 Time: Beginning and X3.1 Sensory:- Taste failure
ending X3.2 Sensory:- Sound Y1 Science and
T3 Time: Old, new and X3.3 Sensory:- Touch technology in general
young; age X3.4 Sensory:- Sight Y2 Information
T4 Time: Early/late X3.5 Sensory:- Smell technology and computing
W1 The universe X4 Mental object Z0 Unmatched proper
W2 Light X4.1 Mental object:- noun
W3 Geographical terms Conceptual object Z1 Personal names
W4 Weather X4.2 Mental object:- Means, Z2 Geographical names
W5 Green issues method Z3 Other proper names
X1 PSYCHOLOGICAL X5 Attention Z4 Discourse Bin
ACTIONS, STATES AND X5.1 Attention Z5 Grammatical bin
PROCESSES X5.2 Z6 Negative
X2 Mental actions and Interest/boredom/exci Z7 If
processes ted/energetic Z8 Pronouns etc.
X2.1 Thought, belief X6 Deciding Z9 Trash can
X2.2 Knowledge X7 Wanting; planning; Z99 Unmatched
X2.3 Learn choosing
X2.4 Investigate, examine, X8 Trying
test, search X9 Ability
44
13 Definitions of smart searches
ADJECTIV [pos="J.*"]
E
ADVERB [pos="R.*"]
BE [pos="VB.*"]
BOOSTER [hw="absolutely|altogether|completely|enormously|entirely|extremely|fully|greatly|highly|intensely|perfectly|strongly|thoroughly|totally|utterly|very"]
COLLECTI [hw="a"
VE_NOUN pos="D.*"][hw="aerie|album|ambush|anthology|archipelago|argument|argumentation|armada|army|array|arsenal|ascension|assembly|aurora|badelynge|bag|bale|band|bank|banner|barrel|barren|
bask|basket|batch|battery|bazaar|bed|bellowing|belt|bench|bevy|bew|bill|bind|bits|blessing|bloat|block|blush|board|bob|body|boil|boll|bond|book|bouquet|bowl|brace|branch|brew|brigade|br
ood|bubble|budget|building|bunch|bundle|bury|business|cache|canteen|caravan|cartload|cast|caste|catalogue|catch|cavalcade|celebration|cete|chain|charm|chatter|chattering|chest|chine|choir|
chorus|circle|circus|clamour|clan|clash|clashing|class|clattering|clew|clique|cloud|clowder|cluck|clump|cluster|clutch|clutter|coalition|coil|collection|colony|column|comb|commonwealth|commun
ion|community|company|compendium|confab|conflagration|confraternity|confusion|congregation|congress|conspiracy|constellation|converting|convocation|convoy|copse|cornucopia|corps|cortege
|cost|cote|coterie|coven|cover|covert|covey|cowardice|cran|crash|crate|creche|crew|crop|crowd|cry|culture|death|deceit|deck|den|descent|desert|destruction|dicker|disguising|dissimulation|di
ving|division|doading|dole|dopping|dout|down|doyft|draft|draught|dray|drift|dropping|drove|drum|dule|durante|dynasty|earth|eleven|embarrassment|equivocation|erst|escargatoire|exaltation|f
aculty|faggot|fall|family|farrow|fellowship|fesnying|fesnyng|festival|fesynes|fidget|field|fine|fitting|fixie|flange|flap|fleet|flick|flight|fling|flink|float|flock|flotilla|flourish|flush|fluther|flutter|fold|fo
rest|fraunch|fun|gaggle|galaxy|gam|gang|garland|garrison|gathering|gatling|gaze|generation|giggle|glaring|gleam|glide|glint|glitter|glory|glossary|grist|group|grove|gulp|hail|hand|haras|harem|h
arvest|haul|head|heap|heard|hedge|herd|hill|hive|holiness|horde|host|house|hover|huddle|hunt|hurtle|husk|illusion|implausibility|index|infestation|intrusion|invention|kaleidoscope|kendle|kenn
el|kettle|kindle|kine|kingdom|knab|knob|knot|labour|lamentation|layer|lead|leap|leash|lepe|library|line|list|litter|lodge|loft|lounge|loveliness|machination|malapertness|marvel|mask|mass|match
|melody|memory|menagerie|mess|mews|miller|mischief|mob|mouthful|movement|multiply|murder|murmuration|muscle|muster|mustering|mutation|mute|necklace|nest|neverthriving|nide|nose
gay|nuisance|number|nursery|nye|obesiance|observance|obstinacy|orchard|orchestra|ostentation|outfit|pace|pack|packet|paddling|pair|panel|panes|pantheon|parade|parcel|parel|park|parliamen
t|party|passel|patrol|peal|peep|pencil|piddle|pile|pint|pit|piteousness|pitying|plague|platoon|plump|pocket|pod|ponder|pontification|pool|posse|pounce|poverty|prattle|prettying|prickle|pride|p
rudence|puddling|pump|punnet|purse|quabble|quarrel|quire|quiver|rabble|radiance|raffle|raft|rafter|rag|rainbow|rake|rangale|range|rayful|ream|reel|regiment|rhumba|richesse|ring|roll|romp|r
ookery|roost|rope|rouleau|round|rout|route|row|royalty|rumble|rump|rumpus|run|rush|salvo|sarcasm|sault|scatter|school|scold|scorn|scourge|screech|scurry|sea|sect|sedge|sequitur|series|ser
ving|set|setting|sheaf|shelf|shimmer|shitload|shoal|shower|shrewdness|shuffle|siege|singular|sizzle|skein|skirl|skulk|slate|sleuth|slew|slither|sloth|smack|snarl|snatch|sneak|sord|sounder|soviet|
sowse|span|spawn|spinney|spring|sprinkle|squad|squadron|stable|stack|staff|stage|stalk|stand|staple|stare|state|stench|stick|stock|storytelling|streak|stream|string|stud|suit|suite|superfluity|sut
e|swarm|swirl|tassel|team|tenement|thought|threatening|thunder|tiding|tittering|toil|tok|torment|totter|tower|trace|train|trembling|tribe|trimming|trip|troop|troubling|troupe|truss|tuft|tumult
|turn|ubiquity|unkindness|venue|vineyard|volery|wad|waddle|wake|walk|warren|watch|wealth|wedge|weyr|wheel|whiteness|whoop|wing|wisdom|wisp|wolfpack|wrack|wreath|yap|yoke|zap|zea
l|zoo"][hw="of"][pos="NN.*"]{1,2}
COMPARA [pos="JJR|RGR|RRR"]
TIVE
COMPLEX [pos="J.*"]{1,5}[pos="NN.*"]
_NOUN
PHRASE
CONDITIO [hw="if|unless"]
NAL
CONNECT [pos="I.*|CS|CC"]
OR
CONTRAC [][word="'(s|re|ve|d|m|em|ll)|n't" pos="[^G].*"]
TION
DEGREE_ [hw="very|really|too|quite|exactly|right|pretty|real|more|relatively" pos="R.*"]
ADVERB
DETERMI [pos="D.*"]
NER
DO [hw="do" pos="VV.*"]
45
DOWNTO [hw="almost|barely|hardly|merely|mildly|nearly|only|partially|partly|practically|scarcely|slightly|somewhat"]
NER
EXISTENTI [pos="EX"]
AL_THERE
GERUND [hw="(?!(.*thing|evening|morning|viking)).{2,}ing" pos="NN[12]"]
HAVE [pos="VH.*"]
INFINITIVE [pos="TO"][pos="V.*"]
HYPHENA [word=".*-.*"]
TED_WOR
D
INDEFINIT [hw="anybody|anyone|anything|everybody|everyone|everything|nobody|none|nothing|nowhere|somebody|someone|something"]
E_PRONO
UN
INFINITIVE [pos="TO"][pos="V.*"]
INTERJECT [pos="UH"]
ION
LINKING_ [hw="then|so|anyway|though|however|e\.?g\.?|i\.?e\.?|therefore|thus|nevertheless|nonetheless" pos="R.*"]
ADVERB
LONG_W [word=".{15,}"]
ORD
MODAL [pos="MD"]
NEGATIO [word="not|.*n't|no|neither|nowhere|never|nor|none|nobody|nothing"]
N
NOMINAL [word=".{3,}(tion|tions|ment|ments|ness|nesses|ity|ities)"]
IZATION
NOUN [pos="N.*"]
NUMBER [pos="M.*"]
PARTICLE [pos="RP"]
PASSIVE [pos="VB[^0].*"][pos="R.*"]{0,3}[pos="V.N"]
PAST_TEN [pos="V.D.?"]
SE
PAST_PAR [pos="V.N"]
TICIPLE
PERFECT_I [pos="TO"][pos="VH.*"][pos="V.N"]
NFINITIVE
PHRASAL_ [pos="VV."][pos="PP.*"]{0,1}[pos="RP"]
VERB
PLACE_AD [hw="aboard|above|abroad|across|ahead|alongside|around|ashore|astern|away|behind|below|beneath|beside|downhill|downstairs|downstream|east|far|hereabouts|indoors|inland|inshore|inside|
VERB locally|near|nearby|north|nowhere|outdoors|outside|overboard|overland|overseas|south|underfoot|underneath|uphill|upstairs|upstream|west"]
PREPOSITI [pos="I.*|CS"][pos="J.*|PP.*|CC|D.*|RR|M.*|GE|N.*"]{0,5}[pos="N.*"]
ONAL_PH
RASE
PRESENT_ [pos="V.GK?"]
PARTICIPL
E
46
PRESENT_ [pos="V.Z"]
TENSE
PRONOU [pos="P.*"]
N
PROPER_ [pos="NP.*"]
NOUN
REFLEXIVE [hw=".*sel(f|ves)" pos="P.X."]
_PRONOU
N
SHORT_W [word=".{1,3}"]
ORD
SPLIT_INFI [pos="TO"][pos="R.*"][pos="V.*"]
NITIVE
SUPERLAT [pos="DAT|JJT|RGT|RRT"]
IVE
SWEARW [hw="arse|arsehole|bastard|bellend|bint|bitch|bloodclaat|bloody|bollocks|bugger|bullshit|clunge|cock|crap|cunt|damn|dick|dickhead|fanny|feck|fuck.*|gash|git|god|goddam|jesus|minge|minger|
ORDS motherfucker|munter|piss|prick|punani|pussy|shit|sod|tit|twat"]
TIME_AD [hw="afterwards?|again|earlier|early|eventually|formerly|immediately|initially|instantly|late|lately|later|momentarily|now|nowadays|once|originally|presently|previously|recently|shortly|simultaneo
VERB usly|soon|subsequently|today|tomorrow|tonight|yesterday"]
VERB [pos="V.*"]
PEOPLE [sem="S2|S2:1|S2:2|S3|S3:1|S3:2|S4"]
MALE [sem="S2:2"]
FEMALE [sem="S2:1"]
SUPERNA [sem="S9"]
TURAL
EMOTION [sem="E|E1|E2|E3|E4|E4:1|E4:2|E5|E6"]
TIME [sem="T1|T1:1|T1:1:1|T1:1:2|T1:2|T1:3|T2|T3|T4"]
PLANET [sem="W1|W2|W3|W4|W5|L1|L2|L3"]
COLOR [sem="O4:3"]
COLOUR [sem="O4:3"]
BODY [sem="B1|B2|B3"]
FOOD [sem="F1|F2"]
TECHNOL [sem="Y1|Y2"]
OGY
MEDIA [sem="Q4|Q4:1|Q4:2|Q4:3|K1|K2|K3|K4"]
47
14 Glossary
Absolute (or raw) frequency – The number of times a linguistic feature occurs in a corpus or its part(s);
the number of hits of a search query in a corpus.
Colligation – Systematic co-occurrence of grammatical categories (e.g. POS tags) in text identified
statistically.
Collocate – A word that systematically occurs with the node (word or phrase of interest, search term).
Concordance line – A single line in the KWIC table, usually containing the node (search match) and
several words before and after it (the right and left context).
Concordance is a typical form of display for examples of language use found in a corpus with the node
(search match) in the middle and several words of context displayed on the left and. Concordance is
sometimes also called a 'KWIC (display)'.
Corpus (pl. corpora) – A collection of language data that can be searched by a computer.
Frequency – The number of times a search query matches text in the corpus. A distinction is made
between absolute (simple number of hits) and relative frequency (number of hits per X number of
words).
KWIC – an abbreviation for 'keyword in context'. This is a typical form of display for examples found in a
corpus with the node (word or phrase of interest) in the middle and several words of context displayed
on the left and right. KWIC is sometimes also called a 'concordance'.
Left context – The words preceding a particular search match (node). Individual positions in the left-
context are referred to as L1 (position immediately preceding), L2, L3 etc.
Lemma / Headword – All inflected forms belonging to one stem. For example, a lemma ‘go’ includes the
following word forms (types): ‘go’, ‘goes’, ‘went’, ‘going’ and ‘gone’.
Node – The word, phrase or grammatical structure of interest; the text matching a search query.
48
Part-of-speech tagging (POS tagging) – A process of adding information about the grammatical category
of each word in a text or corpus. For example, the following sentence was POS-tagged: Automatically_RB
annotates_VBZ data_NNS for_IN part-of-speech_NN.
Regular expressions (regex) – A special meta-language that allows advanced users to search for many
strings simultaneously.
Relative (or normalized) frequency (RF) is calculated as the absolute frequency of a search query divided
by the total number of words searched (the number of words in the corpus or subcorpus). This number
is usually multiplied by an appropriate basis for normalization (e.g. 10,000).
Right context – The words following a particular search match (node). Individual positions in the right-
context are referred to as R1 (position immediately following), R2, R3 etc.
Subcorpus (pl. subcorpora) – A user-defined part of a corpus which searches can be restricted to. It can
include whole texts or parts of multiple texts. In #LancsBox X, subcorpora are defined using XML
structure.
Tagging – The process of adding linguistic information to the words in a text or corpus, automatically or
semi-automatically. See Part-of-speech tagging.
XML – An abbreviation for Extensible Markup Language. A machine-readable way of writing information
in text files that gives structure and annotation to the information. In corpora, XML can annotate words
with part-of-speech information and give structure to texts, for example with sections and paragraphs.
49
Developed @ Lancaster University
50