Quantcast
Channel: Athanasios Velios's blog
Viewing all articles
Browse latest Browse all 30

Basic guide for database design

$
0
0

I have recently worked on a one-year pilot for Linked Data at the University of Oxford. During this pilot I kept coming up to database design decisions which made publishing Linked Data difficult. This page includes some points to consider when designing your database:

Basic guide for database design

To assist work for Linked Data, when designing a database, please observe the following:

Summarise data

Do not use a different file for each record. It is easier to process data automatically if the records are all in the same table. If you want to present or print the data on a per-record basis, then consider a template to "read" records from the table and produce nice-looking pages. To build Linked Data the summarised table is by far more useful.

Avoid free-text

Instead of:

Manuscript with shelfmark MS-Iliad carrying textby Homer.

It should be:

FieldData
shelfmarkMS-Iliad
authorHomer

Why? It is difficult for software to process free text, remove the syntax and identify the entities we are talking about (i.e. MS-Iliad and Homer). It is much easier to identify these if there is no syntax.

Keep information separate

Avoid bundling together different entities. For example instead of a record being:

Dimension
height: 20, width: 10, thickness: 5

It should be:

DimensionValue
height20
width10
thickness5

Why? In Linked Data, each entity needs to stand on its own. Splitting a bundled field programmatically is difficult as often there are no consistent formulas that fields are bundled up.

Do not merge cells or use line breaks

When using spreadsheets to produce records do not use the merge cells function.

Instead of:

ValueUnit
Height20cm
Width10cm
ThicknessMax thickness8cm
Min thickness5cm

It should be:

DimensionValueUnit
height20cm
width10cm
min thickness5cm
max thickness8cm

Similarly do not use linefeeds within cells to indicate multiple records. Use instead a delimeter like | which is much easier to process.

Why? It is much easier to "read" the data if it is all in a canonical table on a row-by-row basis. Merged cells and linefeeds break that canonical structure or confuse the rows.

Use identifiers

Give identifiers to entities contributing to a record. For example, instead of:

ShelfmarkAuthor
MS-IliadHomer

It should be:

Manuscript IDShelfmarkAuthor IDAuthor
1234MS-Iliad5678Homer

Why? Not having an identifier means that the included entity (e.g. Homer) is "hidden" in text and cannot be matched to other occurences across the database. Some institutions choose to produce UUIDs as identifiers for each entity.
Note that if there are multiple authors either a new table would be neccessary or multiple rows of MS-Iliad would be required, each with a different author. This indicates the requirement for a so-called one-to-many relationship across entities which is difficult to replicate on a single spreadsheet.

Use external authorities

Allow space for external identifiers of entities. Instead of:

Author IDAuthor name
5678Homer

It should be:

Author IDAuthor nameExternal AuthorityExternal ID
5678HomerVIAF224924963
5678HomerWikiDataQ6691

Why? Linked data depend on establishing links with other datasets. This process is known as reconciliation or disambiguation. For example 5678"Homer" is the same person as the one described in VIAF: https://viaf.org/viaf/224924963. It is useful for a database to be able to store external identifiers for entities (even external labels) to enable this linking. There are many authority files and thesauri which publish identifiers for their records. Make sure that when you are building your records you can capture these.

Reference images

Do not insert image files in your records. Only add the location as a reference to where the image can be seen. Preferably this should be a URL but in theory local paths are equally useful.


Viewing all articles
Browse latest Browse all 30

Trending Articles