I have recently worked on a one-year pilot for Linked Data at the University of Oxford. During this pilot I kept coming up to database design decisions which made publishing Linked Data difficult. This page includes some points to consider when designing your database:
Basic guide for database design
To assist work for Linked Data, when designing a database, please observe the following:
Summarise data
Do not use a different file for each record. It is easier to process data automatically if the records are all in the same table. If you want to present or print the data on a per-record basis, then consider a template to "read" records from the table and produce nice-looking pages. To build Linked Data the summarised table is by far more useful.
Avoid free-text
Instead of:
Manuscript with shelfmark MS-Iliad carrying textby Homer.
It should be:
Field | Data |
---|---|
shelfmark | MS-Iliad |
author | Homer |
Why? It is difficult for software to process free text, remove the syntax and identify the entities we are talking about (i.e. MS-Iliad and Homer). It is much easier to identify these if there is no syntax.
Keep information separate
Avoid bundling together different entities. For example instead of a record being:
Dimension |
---|
height: 20, width: 10, thickness: 5 |
It should be:
Dimension | Value |
---|---|
height | 20 |
width | 10 |
thickness | 5 |
Why? In Linked Data, each entity needs to stand on its own. Splitting a bundled field programmatically is difficult as often there are no consistent formulas that fields are bundled up.
Do not merge cells or use line breaks
When using spreadsheets to produce records do not use the merge cells function.
Instead of:
Value | Unit | ||
---|---|---|---|
Height | 20 | cm | |
Width | 10 | cm | |
Thickness | Max thickness | 8 | cm |
Min thickness | 5 | cm |
It should be:
Dimension | Value | Unit |
---|---|---|
height | 20 | cm |
width | 10 | cm |
min thickness | 5 | cm |
max thickness | 8 | cm |
Similarly do not use linefeeds within cells to indicate multiple records. Use instead a delimeter like |
which is much easier to process.
Why? It is much easier to "read" the data if it is all in a canonical table on a row-by-row basis. Merged cells and linefeeds break that canonical structure or confuse the rows.
Use identifiers
Give identifiers to entities contributing to a record. For example, instead of:
Shelfmark | Author |
---|---|
MS-Iliad | Homer |
It should be:
Manuscript ID | Shelfmark | Author ID | Author |
---|---|---|---|
1234 | MS-Iliad | 5678 | Homer |
Why? Not having an identifier means that the included entity (e.g. Homer) is "hidden" in text and cannot be matched to other occurences across the database. Some institutions choose to produce UUIDs as identifiers for each entity.
Note that if there are multiple authors either a new table would be neccessary or multiple rows of MS-Iliad would be required, each with a different author. This indicates the requirement for a so-called one-to-many relationship across entities which is difficult to replicate on a single spreadsheet.
Use external authorities
Allow space for external identifiers of entities. Instead of:
Author ID | Author name |
---|---|
5678 | Homer |
It should be:
Author ID | Author name | External Authority | External ID |
---|---|---|---|
5678 | Homer | VIAF | 224924963 |
5678 | Homer | WikiData | Q6691 |
Why? Linked data depend on establishing links with other datasets. This process is known as reconciliation or disambiguation. For example 5678
"Homer" is the same person as the one described in VIAF: https://viaf.org/viaf/224924963. It is useful for a database to be able to store external identifiers for entities (even external labels) to enable this linking. There are many authority files and thesauri which publish identifiers for their records. Make sure that when you are building your records you can capture these.
Reference images
Do not insert image files in your records. Only add the location as a reference to where the image can be seen. Preferably this should be a URL but in theory local paths are equally useful.