Coding historical data

Introduction

This post will focus on datasets in historical research, their creation, applications and limits. As I am not a historian, this text will be more technical and ... well, brief :).

Quantitative Historical research

Is a branch of historical science that tries to describe history with the help of statistics or computer science. The crucial predisposition for such a research is then an input dataset that could be subsequently analyzed and used for answering questions or understanding a certain topic/problem. Quantitative historical research has numerous applications, e.g. historical demography or economic history.

Sources of Historical Data

Historians use to work with various historical sources. First, we have primary sources like archaeological excavations, manuscripts of diaries. Then, there are secondary (or tertiary) sources like research papers, encyclopedias, maps and so on. All of this sources could be used in quantitative historical research as they contain certain data. But the process of its extraction could be specific for each source type, for example:

manuscript often contains unstructured text that has to be coded and relevant information extracted manually
an outcome of scientific papers could be a computer-readable table of a file that is possible to just "save and use"
maps in raster format have to be georeferenced and the information extracted in GIS
archaeologists store their excavations in a structured list/database that is possible to download and extract relevant information

The Life Stages of Historical Data

There are several life stages of historical datasets. At point zero there is hard work of domain experts, that is historians but first of crucial part of any dataset creation is data consideration. That means there are no data without "guided" observation and collection of these observations. The data are not innocent information, they are constructed with a problem/research question in mind. This question manages the choice of sources for the dataset. Then follows an uneasy process of data preparation/extraction/transformation, which for most of the case in GEHIR project was still a lot of manual labor with the human brain as a bridging interface between books and spreadsheet. As any data scientist know, data preparation is usually the most time-consuming work. When the dataset is ready, an analysis of the research hypothesis follows. Also, an analyzed and processed dataset could be used for propagation of the topic in the form of a webpage, map, visualization, etc. Adhering to principles of open science, the last phase should be dataset publication.

Let's talk more about dataset structure.

Datasets variable domains/types

Majority of the dataset are matrices with columns identifying attributes and rows as cases. Before the data extraction, the structure of dataset has to be defined. Every attribute (column) has to have a domain, i.e. possible values (nominal, ordinal, quantitative, binary...). Often, the dataset consists of a more specific type:

a date could have different granularity (the year of 1452, 2nd century, 11.09 14:57...) and a specific domain type (it is not a text and not a number) so the extraction and analysis could be a little tricky (see Table 1 for an example of handling date data)
geographical coordinates are in most cases defined in latitude and longitude (in grades) that could be stored as a number. While it looks like a standard quantitative analysis, it requires a specific
image is sometimes very interesting data from the perspective of storing and analyzing. Nowadays the computer vision and image analysis may contain methods that could be used in the historical research (automatic analysis of given artifact and the extraction of specific data)
a hyperlink is a way to refer to a specific place in the different dataset. Therefore it is possible to extract or work with more information without filling the dataset with duplicates
long free-form descriptions might help in a further qualitative research. It is also possible to use such a column to store interesting information that is not able to fit in a defined structure.

From the perspective of storing data, we have nowadays various possibilities:

database (mysql, postgresql, nosql...) is the most robust and technically demanding solution used mainly for more complex datasets decomposed in the mutually related tables
table (xls, csv, md ...) is very common as we have a broad selection of tool and the outcome could be easily transferable, sharable or readable
GIS formats (geojson, ESRI Shapefile, KML...) allow easier handling of geographical data within the GIS software (QGIS, ArcGIS). It is also possible to store geographic information within a table or a database, but GIS formats often allow more topology types (lines, polygons...) and provide better possibilities for the geographic analysis

Examples of Historical Datasets

While working on the GEHIR project we came across a various collection of datasets that are ready-to-use in a historical research. Just to mention a few of them:

DARMC (https://darmc.harvard.edu) - The Digital Atlas of Roman and Medieval Civilizations "makes freely available on the internet the best available materials for a Geographic Information Systems (GIS) approach to mapping and spatial analysis of the Roman and medieval worlds"
AWMC (http://awmc.unc.edu/wordpress/, https://github.com/AWMC/geodata) - Ancient World Mapping center that "promotes cartography, historical geography, and geographic information science as essential disciplines within the field of ancient studies through innovative and collaborative research, teaching, and community outreach activities"
Pelagion (http://commons.pelagios.org/) is more a gazzetter that connects data than a collection of datasets itself.

GEHIR Datasets

And finally the most interesting part - for the purpose of GEHIR project we coded few datasets that we would like to share this way. All of our data are geocoded and contain various attributes. Csv format was chosen as it is considered as the easiest readable one, possible to open in most tools and also our geographical topologies are limited to points (stored as x and y coordinates) so this format is sufficient.

Isiac artefacts

isis artefacts.csv

undefined

Isis temples

isis temples.csv

undefined

Mithraic places

mithraic places.csv

undefined

Jewish synagogues

synagogues.csv

` undefined

Early Christian Churches

churches.csv

undefined

Table 1: Date Coding rules

Text example	Date - precise	Date - post quem	Date - ante quem	Ceased to exist
datable to 392	392
of 392	392
ca. 350		346	355
built by 366		-	366
first attested in 353/4			354
may 11, 218 b.c.e.	-218
march/april 52 c.e.	52
the period 313-28		313	328
330 BC - 640 AD (330 BCE - 640 CE)		-330	640
ca. 90–100 c.e.		90	100
80–90 c.e.		80	90
second half of second century to first half of third century c.e.		151	250
early first century b.c.e.		-100	-76
perhaps of the 4rd c.		301	400
dating to the 1st half of the 4th c.		301	350
in mid-4th c.		326	375
mid-fourth c.		326	375
4th c.		301	400
late 4th c.		376	400
end of 4rd c.		376	400
by the end of the 4th		-	400
from the last quarter of the 4th c.		376	400
early decades of the 4th c		300	329
built in about the 360s		360	369
by the 340s		-	349
in 320s		320	329
teens of the 4th		310	319
burnt in the 360s		-	369	369
the late 4th c. or more probably the early 5th		376	425
the last third of the 4th c., perhaps 385		367	400
4th c. or 5th c.?		301	500
possibly first or second century c.e.		1	200

References

Anderson, Margo. Quantitative history. The Sage Handbook of Social Science Methodology, ed. William Outwaite and Stephen Turner, London: Sage Publications, 2007.

Coding historical data

Introduction

Quantitative Historical research

Sources of Historical Data

The Life Stages of Historical Data

Datasets variable domains/types

Examples of Historical Datasets

GEHIR Datasets

Isiac artefacts

Isis temples

Mithraic places

Jewish synagogues

Early Christian Churches

References

More articles

We are CEDRR now…

GEHIR continues!

Historical Network Research conference in Brno, 11-13 September

Historical geocoding assistant