Introduction

This post will focus on datasets in historical research, their creation, applications and limits. As I am not a historian, this text will be more technical and ... well, brief :).

Quantitative Historical research

Is a branch of historical science that tries to describe history with the help of statistics or computer science. The crucial predisposition for such a research is then an input dataset that could be subsequently analyzed and used for answering questions or understanding a certain topic/problem. Quantitative historical research has numerous applications, e.g. historical demography or economic history. 

Sources of Historical Data

Historians use to work with various historical sources. First, we have primary sources like archaeological excavations, manuscripts of diaries. Then, there are secondary (or tertiary) sources like research papers, encyclopedias, maps and so on. All of this sources could be used in quantitative historical research as they contain certain data. But the process of its extraction could be specific for each source type, for example:

  • manuscript often contains unstructured text that has to be coded and relevant information extracted manually
  • an outcome of scientific papers could be a computer-readable table of a file that is possible to just "save and use"
  • maps in raster format have to be georeferenced and the information extracted in GIS
  • archaeologists store their excavations in a structured list/database that is possible to download and extract relevant information    

 

The Life Stages of Historical Data

There are several life stages of historical datasets. At point zero there is hard work of domain experts, that is historians but first of crucial part of any dataset creation is data consideration. That means there are no data without "guided" observation and collection of these observations. The data are not innocent information, they are constructed with a problem/research question in mind. This question manages the choice of sources for the dataset. Then follows an uneasy process of data preparation/extraction/transformation, which for most of the case in GEHIR project was still a lot of manual labor with the human brain as a bridging interface between books and spreadsheet. As any data scientist know, data preparation is usually the most time-consuming work. When the dataset is ready, an analysis of the research hypothesis follows. Also, an analyzed and processed dataset could be used for propagation of the topic in the form of a webpage, map, visualization, etc.  Adhering to principles of open science, the last phase should be dataset publication.

Let's talk more about dataset structure. 

Datasets variable domains/types

Majority of the dataset are matrices with columns identifying attributes and rows as cases. Before the data extraction, the structure of dataset has to be defined. Every attribute (column) has to have a domain, i.e. possible values (nominal, ordinal, quantitative, binary...). Often, the dataset consists of a more specific type:

  • a date could have different granularity (the year of 1452, 2nd century, 11.09 14:57...) and a specific domain type (it is not a text and not a number) so the extraction and analysis could be a little tricky (see Table 1 for an example of handling date data)
  • geographical coordinates are in most cases defined in latitude and longitude (in grades) that could be stored as a number. While it looks like a standard quantitative analysis, it requires a specific
  • image is sometimes very interesting data from the perspective of storing and analyzing. Nowadays the computer vision and image analysis may contain methods that could be used in the historical research (automatic analysis of given artifact and the extraction of specific data)
  • a hyperlink is a way to refer to a specific place in the different dataset. Therefore it is possible to extract or work with more information without filling the dataset with duplicates
  • long free-form descriptions might help in a further qualitative research. It is also possible to use such a column to store interesting information that is not able to fit in a defined structure.

 

From the perspective of storing data, we have nowadays various possibilities:

  • database (mysql, postgresql, nosql...) is the most robust and technically demanding solution used mainly for more complex datasets decomposed in the mutually related tables
  • table (xls, csv, md ...) is very common as we have a broad selection of tool and the outcome could be easily transferable, sharable or readable 
  • GIS formats (geojson, ESRI Shapefile, KML...) allow easier handling of geographical data within the GIS software (QGIS, ArcGIS). It is also possible to store geographic information within a table or a database, but GIS formats often allow more topology types (lines, polygons...) and provide better possibilities for the geographic analysis

 

Examples of Historical Datasets

While working on the GEHIR project we came across a various collection of datasets that are ready-to-use in a historical research. Just to mention a few of them:

  • DARMC (https://darmc.harvard.edu) - The Digital Atlas of Roman and Medieval Civilizations "makes freely available on the internet the best available materials for a Geographic Information Systems (GIS) approach to mapping and spatial analysis of the Roman and medieval worlds"
  • AWMC (http://awmc.unc.edu/wordpress/, https://github.com/AWMC/geodata) - Ancient World Mapping center that "promotes cartography, historical geography, and geographic information science as essential disciplines within the field of ancient studies through innovative and collaborative research, teaching, and community outreach activities"
  • Pelagion (http://commons.pelagios.org/) is more a gazzetter that connects data than a collection of datasets itself.

 

GEHIR Datasets

And finally the most interesting part - for the purpose of GEHIR project we coded few datasets that we would like to share this way. All of our data are geocoded and contain various attributes. Csv format was chosen as it is considered as the easiest readable one, possible to open in most tools and also our geographical topologies are limited to points (stored as x and y coordinates) so this format is sufficient.

Isiac artefacts

isis artefacts.csv

undefined 

 

Isis temples

isis temples.csv

 

undefined

 

Mithraic places

mithraic places.csv

undefined

 

Jewish synagogues

synagogues.csv

 

`undefined 

 

Early Christian Churches

churches.csv

 

undefined

 

Table 1: Date Coding rules

Text example Date - precise Date - post quem Date - ante quem Ceased to exist
datable to 392 392      
of 392 392      
ca. 350   346 355  
built by 366   - 366  
first attested in 353/4     354  
may 11, 218 b.c.e. -218      
march/april 52 c.e. 52      
the period 313-28   313 328  
330 BC - 640 AD (330 BCE - 640 CE)   -330 640  
ca. 90–100 c.e.   90 100  
80–90 c.e.   80 90  
second half of second century to first half of third century c.e.   151 250  
early first century b.c.e.   -100 -76  
perhaps of the 4rd c.   301 400  
dating to the 1st half of the 4th c.   301 350  
in mid-4th c.   326 375  
mid-fourth c.   326 375  
4th c.   301 400  
late 4th c.   376 400  
end of 4rd c.   376 400  
by the end of the 4th   - 400  
from the last quarter of the 4th c.   376 400  
early decades of the 4th c   300 329  
built in about the 360s   360 369  
by the 340s   - 349  
in 320s   320 329  
teens of the 4th   310 319  
burnt in the 360s   - 369 369
the late 4th c. or more probably the early 5th   376 425  
the last third of the 4th c., perhaps 385   367 400  
4th c. or 5th c.?   301 500  
possibly first or second century c.e.   1 200  

 

References

  • Anderson, Margo. Quantitative history. The Sage Handbook of Social Science Methodology, ed. William Outwaite and Stephen Turner, London: Sage Publications, 2007.