Skip to content

4. Test Data

The data used for the matching experiments and described in the next subsections are provided by the Golden Agents project and the Getty.

4.1 Target Data

The SAA Inventory Documents database is used in this work as the target data. A number of handwritten inventory documents have already undergone the digitising process amid the SAA’s goal to computerise its data. As the sole automated Handwritten Text Recognition (HTR) of inventory documents is not of sufficient quality, the current data in use has been curated. Nevertheless, it still contains some irregularities. These include among others, (1) segments that are broken sentences, which (2) sometimes come out of order, (3) the appearance of the symbol ^ in “Een ^ besgie” denoting the inability of the HTR to recognise a handwritten term.

An important note regarding the broken segments. One would have expected the curated computerised documents to come free of broken segments. However, because the goal is to have a transcript faithful to the original (instead of outputting correct sentence segments), the broken sentences are again present in the curated version as observed in the original handwritten document.

SAA Inventory documents sample
-------------------------------------------------------------------------------------------------------------------------------------------
SAA-item-htr-uri                                                                    text
-------------------------------------------------------------------------------------------------------------------------------------------
https://archief.amsterdam/inventarissen/inventaris/5075.nl.html#A16098000031r1l1    Inventaris ende Specificatie
https://archief.amsterdam/inventarissen/inventaris/5075.nl.html#A16098000031r1l2    van allen den goederen nagelaten
https://archief.amsterdam/inventarissen/inventaris/5075.nl.html#A16098000031r1l3    bij zal. Maria koerten sulcx
https://archief.amsterdam/inventarissen/inventaris/5075.nl.html#A16098000031r1l4    zij die met haer man Niclaes felt
https://archief.amsterdam/inventarissen/inventaris/5075.nl.html#A16098000031r1l5    int gemeen beseten ende metter
https://archief.amsterdam/inventarissen/inventaris/5075.nl.html#A16098000031r1l6    dood deser werelt ontruijmt
https://archief.amsterdam/inventarissen/inventaris/5075.nl.html#A16098000031r1l7    ende nagelaten heeft.
https://archief.amsterdam/inventarissen/inventaris/5075.nl.html#A16098000031r1l8    Eerstelijck den huijsraet, Imboel
https://archief.amsterdam/inventarissen/inventaris/5075.nl.html#A16098000031r1l9    klederen, kleijnodien ende Juwelen
https://archief.amsterdam/inventarissen/inventaris/5075.nl.html#A16098000031r1l10   getaxeert door Annetje hendricx en Susannetje
https://archief.amsterdam/inventarissen/inventaris/5075.nl.html#A16098000031r1l11   Anthonis gesworen Taxeersters deser stede
https://archief.amsterdam/inventarissen/inventaris/5075.nl.html#A16098000031r1l12   tot alsulcken prijse als volcht
https://archief.amsterdam/inventarissen/inventaris/5075.nl.html#A16098000031r1l14   Een wiegh met een baecker mat, ende
https://archief.amsterdam/inventarissen/inventaris/5075.nl.html#A16098000031r1l15   een matstoel, een bakermand ende
https://archief.amsterdam/inventarissen/inventaris/5075.nl.html#A16098000031r1l16   4
https://archief.amsterdam/inventarissen/inventaris/5075.nl.html#A16098000031r1l17   eenige kleijne mantjes
https://archief.amsterdam/inventarissen/inventaris/5075.nl.html#A16098000031r1l18   Een scharfbort met een doos ende eenigh
https://archief.amsterdam/inventarissen/inventaris/5075.nl.html#A16098000031r1l19   -10
https://archief.amsterdam/inventarissen/inventaris/5075.nl.html#A16098000031r1l20   withoutwerck en rommeling
https://archief.amsterdam/inventarissen/inventaris/5075.nl.html#A16098000031r1l21   Een ijsere rooster, brandijser en een
https://archief.amsterdam/inventarissen/inventaris/5075.nl.html#A16098000031r1l22   2

4.2 Source data

In order to document the ownership of painting mentioned in inventories, Montias and others transcribe about 1280 inventories in total. This database of Transcripts (MDB) is therefore used in this work for as the source data for testing purpose.

During the task of transcribing, a number of decision was made. For example,

  • The particular way of enumerating an item such as “No. 21” in the digitised SAA is simply discarded by Montias.

  • Montias included its own representation of the value of an item, for example … gilders is represented as “2:10:–” .

  • Abbreviations do not appear consistently in both transcripts. “No. 39. L. vrouwe beeldekens” in SAA for example appears as “Lieve-Vrouwe-beeldekens 1:10:–“ in MDB, but also turned into one word by the use of hyphens.

  • A number of spelling variations are also observed. For example, the following terms “Drij, water hontjes, prentebortjes, van de” in SAA are transcribed in MDB as “Dry , waterhontjes, print borts, vande”

  • Other observations come as modifications and repatitions. For example, in the SAA, “5 alabaster bortjes. Een affneming van ‘t kruys” become “Vier Albastarde bortjes met vergulde lijstjes. Een affneming vant kruijs” in MDB. Observed that the number 5 has been mistakenly reported as the word vier instead of vijf.

Additionaly, the single item “5 alabaster bortjes” in SAA appears five times in MDB and sometimes with mistake: “Vier Albastarde bortjes met vergulde lijstjes.”

Montias data transcript sample.
---------------------------------------------------------------------------------------------------------
Getty-Frick-item-uri                                        transcription
---------------------------------------------------------------------------------------------------------
http://goldenagents.org/uva/SAA/Inventory/Item/N-2097_0001  Een schilderij van de Samaritaen
http://goldenagents.org/uva/SAA/Inventory/Item/N-2097_0002  Een stuck schilderij van de liefde
http://goldenagents.org/uva/SAA/Inventory/Item/N-2097_0003  De salvingh Christi
http://goldenagents.org/uva/SAA/Inventory/Item/N-2097_0004  Een zeevaertje
http://goldenagents.org/uva/SAA/Inventory/Item/N-2097_0005  Een landschap met een vergulde lijst
http://goldenagents.org/uva/SAA/Inventory/Item/N-2097_0006  Een landschap
http://goldenagents.org/uva/SAA/Inventory/Item/N-2097_0007  Een gerecht
http://goldenagents.org/uva/SAA/Inventory/Item/N-2097_0008  Eenich geselschap van boeren
http://goldenagents.org/uva/SAA/Inventory/Item/N-2097_0009  Een landschap
http://goldenagents.org/uva/SAA/Inventory/Item/N-2097_0010  Een landschap
http://goldenagents.org/uva/SAA/Inventory/Item/N-2097_0011  De liefde
http://goldenagents.org/uva/SAA/Inventory/Item/N-2097_0012  Een landschapje
http://goldenagents.org/uva/SAA/Inventory/Item/N-2097_0013  Mede een landschap
http://goldenagents.org/uva/SAA/Inventory/Item/N-2097_0014  Eenich geberchte
http://goldenagents.org/uva/SAA/Inventory/Item/N-2097_0015  Salvator
http://goldenagents.org/uva/SAA/Inventory/Item/N-2097_0016  Een oud besgie
http://goldenagents.org/uva/SAA/Inventory/Item/N-2097_0017  Een stuckje, gemaeckt bij Potuyl
http://goldenagents.org/uva/SAA/Inventory/Item/N-2097_0018  De (N.B. verbeterd uit: Een) geboorte
http://goldenagents.org/uva/SAA/Inventory/Item/N-2097_0019  Noch een stuckje, van Potuyl
http://goldenagents.org/uva/SAA/Inventory/Item/N-2097_0020  Een munnick ende een bagijn

4.3 Foreseen matching issues

All the afore mentioned irregularities leads to some extra difficulties in the already hard task of segment matching. Below, we categorise these foreseen problems as major versus minor issues. At the moment, the main issue comes with the broken sentences and ties. Ties occur when the same item occurs several times in the target data.

Major

  • Broken sentences
  • Abbreviations
  • Combined words
  • Numbers to words
  • Terms not recognised by HTR
  • Repetitions (same query posed more than once)
  • Ties (multiple answers for the same query)

Minor

  • Spelling variations
  • Small mistakes
  • Normalisation
  • Data cleaning: remove terms that are added or disregarded by Montias.

Repetitions & Ties

Duplicates in the target data appear for good reasons. A named item (“Een landschap”) can occur more than once in the same house (same inventory) or in multiple houses (multiple inventories). In the source data, the same phenomenon can occur for the same reason (correct transcript) or due to subjective interventions, for example reporting the segment 5 times although it appears one time in the original document as a set instead. Ideally, we would like to be able to remove those subjective duplicates so that we do not search for the same target several times. Observe that computing the same question/answer several times may result in a misleading evaluation of the algorithm.

These interventions make it difficult to remove duplicates in the source as there is little or no information on which to base the inference of VALID versus INVALID duplicates. So, although occurring three times (A1:Een landschap, A2:Een landschap, A3:Een landschap), A2:Een landschap can not be linked to a randomly selected exact match from a set of segments in the target, which are similar to the source (B1:Een landschap, B2:Een landschap, B3:Een landschap). In other words, A2:Een landschap can only be matched to A2:Een landschap for the sake of example. This illustration indicates that, in practice, for a given source transcript there is only one correct answer in the target data.