1 Armadillo Data Extraction Across Multiple Text Datasets

Armadillo Data Extraction Across Multiple Text Datasets for Arts and Humanities Research Mark Greengrass University of Sheffield Response to the RePAH questionnaire (2005-6),

aggregate of all Arts and Humanities respondants (Repah: A User Requirements Analysis Report (2006), p. 102. Repah, A user requirements analysis

Some Distinctive Features of in Historians Approach to their Evidence Promiscuous range of sources consulted Firm distinction between primary and secondary sources Complex dialogue between existing historiography and constitutive source materials Reiterative process of open interrogation of source materials A coherent narrative consists of one composed (generally) from more than one

Historians Database Challenge Growing number of (mainly text-based) historical datasets in electronic media, furnished from a wide variety of providers These datasets utilise a variety of different historical sources They contain varying amounts of encoded information (dependant on the historical question being asked by the PI; and by the constraints of the particular source being used) The information is encoded in different ways The delivery formats used also vary widely

Sources The Marine Society Registers Prerogative Court of Canterbury Wills St. Martins Settlement Exams Index

WESTCAT Metropolitan London in the 1690s IHR The Westminster Historical Database The Proceedings of the Old Bailey Eighteenth Century Fire Insurance Policies Collage image databse

Guildhall Library Harbens Dictionary of London John Strypes Survey Selected Criminal Records TNA http://www.motco.com House of Lords Journals BOPCRIS 15 July 2007

The Old Bailey Proceedings: XML

WilliamMawn was Tryed for stealing a Bay Gelding price 20 l. from one ThomasLane out of

Berkshire on the 25th of April. The Witness swore that the Horse was found in the Prisoner's custody in Smithfield, which the Prosecutor owned to be his. The Prisoner could not produce any Evidence to prove that he came honestly by the Horse only produc'd a Felonious person, that was no stranger to Newgate, who went under the Notion of his Man, he declared that the Prisoner bought the Horse upon the Road beyond Uxbridge. The Prisoners being found in several faultering stories, he was found Guilty .

[Death. See summary.]

15 July 2007 (c) M.Greengrass

Canterbury Wills: Delimited Text 2530553 2530553 W W Agnes Kervill or Kervytt Andrew Bridham London 2530553

W Andrew London 2530553 2530553 2530553 2530553 2530553 2530553 2530553 2530553 2530553 2530553 2530553 2530553 2530553

2530553 2530553 2530553 2530553 W W W W W W W W W W W W W W

W W W Austin Hawkyns Cecilia Foster Christian Chepman Christian Cust David Syadine Bristol, Edmund Bybbesworth Edward Wellys Hadley, Ellen Lacy Widow Saint Pe Gerard Heshull

Guy Shuldham Helmingus Leget Henry Porter Henry Warlegh Keynesha Henry Wellis Hugh Caundyssh Hugh Geynesburgh Rector Isabelle Woodhill 15 July 2007 (c) M.Greengrass

The Issues Can the technologies developed for the semantic web help us: To structure the (different) encoded information across varying sources in a way that the user community will find (research) fruitful? To understand the way in which these different sources relate to one another, such that they can be used in an intelligent fashion? To bootstrap relevant historical/semantic information from one source, by using another?

Data Sharing and Data Reuse Reuse means to build new applications, assembling components already built Sharing is when different applications use the same resources Oscar Korcho (with acknowledgement)

Oscar Korcho (with acknowledgement) 12 Ontologies Problem Solving Methods escribe domain knowledge in a generic wayDescribe the reasoning process of a dataset d provide agreed understanding of a domain (Knowledge-Based System) in a domain-independent manner Interaction Problem Representing Knowledge for the purpose of solving some problem is strongly affected by the nature of the problem

and the inference strategy to be applied to the problem Bylander Chandrasekaran, B. Generic Tasks in knowledge-based reasoning.: the right level of abstraction for knowledge acquisitio In B.R. Gaines and J. H. Boose, EDs Knowledge Acquisition for Knowledge Based systems, 65-77, London: Academic Press 1988. 15 July 2007 (c) O. Corcho (with acknowledgement) Definitions of an Ontology 1. An ontology defines the basic terms and relations comprising the vocabulary of a topic area, as well as the rules for combining terms and relations to define 13

Neches R, Fikes RE, Finin T, Gruber TR, Senator T, Swartout WR (1991) Enabling technology for knowledge sharing. AI Magazine 12(3):3656 extensions to the vocabulary 2. An ontology is an explicit specification of a conceptualization 3. An ontology is a formal, explicit specification of a shared conceptualization Gruber TR (1993a) A translation approach to portable ontology specification. Knowledge Acquisition 5(2):199220 Studer R, Benjamins VR, Fensel D (1998) Knowledge Engineering: Principles and Methods. IEEE Transactions on Data and Knowledge

Engineering 25(1-2):161197 4. A logical theory which gives on explicit, partial account of a conceptualization Guarino N, Giaretta P (1995) Ontologies and Knowledge Bases: Towards a Terminological Clarification. In: Mars N (ed) Towards Very Large Knowledge Bases: Knowledge Building and Knowledge Sharing (KBKS95). University of Twente, Enschede, The Netherlands. IOS Press, Amsterdam, The Netherlands, pp 2532 5. A set of logical axioms designed to account for the intended meaning of a vocabulary Guarino N (1998) Formal Ontology in Information Systems. In: Guarino N (ed) 1st International Conference on

Key Components of an Ontology Concepts are organized in taxonomies Relations R: C1 x C2 x ... x Cn-1 x Cn Subclass-of: Concept 1 x Concept2 Connected to: Component1 x Component2 Functions F: C1 x C2 x ... x Cn-1 --> Cn Mother-of: Person --> Women Price of a used car: Model x Year x Kilometers --> Price Instanc Elements es Axioms Sentences which are always true

Semantic Continuum and Formality Shared human consensus Semantics hardwired; used at runtime Text descriptions Informal [explicit] Implicit e.g. Language Formal (for humans) e.g. dictionaries

consensus Semantics hardwired; used at runtime Text descriptions Informal [explicit] Implicit e.g. Language Formal (for humans) e.g. dictionaries

e.g. library catalogues Semantics processed and used at runtime Formal [for machines] E.g. see below

http://www.vicodi.org

Webbased seconda ry historical writing Primary sources (historica l documen ts; images; artefacts) in elecronic

top-down ontologies (generated from discipline-accepted taxonomies) middle-out ontologies (generated by intelligent iteration) bottom-up ontologies (generated from a representative sample of canonical data

John Wilkins, An Essay towards a Real Character and a Philosophical Language (1668)

towards a Real Character and a Philosophical Language (1668) 15 July 2007 (c) M.Greengrass 22 15 July 2007 (c) M.Greengrass 23

15 July 2007 (c) M.Greengrass 24 15 July 2007 (c) M.Greengrass 25 15 July 2007

(c) M.Greengrass 26 15 July 2007 (c) M.Greengrass 27 15 July 2007 (c) M.Greengrass

28 15 July 2007 (c) M.Greengrass 29 15 July 2007 (c) M.Greengrass 30

Armadillo a Semantic Agent Retrieves information according to pre-agreed ontologies Takes account of deviations in spelling, typographic formatting and contextual information Makes use of delimited fields and tagged data as oracles to provide firm instantiations of elements in an ontology to apply to electronic materials which have no such structure

31 15 July 2007 (c) M.Greengrass 32 15 July 2007 (c) M.Greengrass 33 15 July 2007

(c) M.Greengrass 34 15 July 2007 (c) M.Greengrass 35 15 July 2007 (c)

M.Greengrass 36 15 July 2007 (c) M.Greengrass 37 15 July 2007 (c) M.Greengrass 38

15 July 2007 (c) M.Greengrass 39 15 July 2007 (c) M.Greengrass 40 15 July 2007

Automated Text-Mining, used for tagging purposes in Central Criminal Court records


Held on Monday, December 17th, 1866, and following days,

BEFORE THE RIGHT HON. THOMAS GABRIEL, LORD MAYOR of the City of London; Sir JOHN MELLOR, Knt., one of the Justices of Her Majesty's

Court of Queen's Bench; WILLIAM TAYLOR COPELAND, Esq., THOMAS CHALLIS, Esq., THOMAS QUESTED FINNIS, Esq., Sir ROBERT WALTER CARDEN, Knt., and WILLIAM Automated Text-Mining, used for tagging purposes in Central Criminal Court records with less success! Held on Monday, July 22nd, 1912, and following days.



Held on Monday, July 22nd, 1912, and following days.

Before the Right Hon. Sir THOMAS BOOR CROSBY, M.D., LORD MAYOR of the said City of London; the Right Hon. Lord COLERIDGE, one of the Justices of His Majesty's High Court; Sir HENRY KNIGHT, Knight; Sir HORATIO DAVIES, K.C.M.G.; Sir JOHN POUND, Bart.; Sir GEORGE W. TRUSCOTT, Bart.; Sir CHARLES JOHNSTON, Knight; and Sir HORACE B. MARSHALL, Knight, LL.D., Aldermen of the said City;

Sir FORREST FULTON, Not identified Knight, K.C., Recorder of the said City; Sir FK. Not ALBERT identified BOSANQUET, K.C., Common Serjeant of the said City;

