This post will focus on datasets in historical research, their creation, applications and limits. As I am not a historian, this text will be more technical and ... well, brief :).
Quantitative Historical research
Is a branch of historical science that tries to describe history with the help of statistics or computer science. The crucial predisposition for such a research is then an input dataset that could be subsequently analyzed and used for answering questions or understanding a certain topic/problem. Quantitative historical research has numerous applications, e.g. historical demography or economic history.
Sources of Historical Data
Historians use to work with various historical sources. First, we have primary sources like archaeological excavations, manuscripts of diaries. Then, there are secondary (or tertiary) sources like research papers, encyclopedias, maps and so on. All of this sources could be used in quantitative historical research as they contain certain data. But the process of its extraction could be specific for each source type, for example:
- manuscript often contains unstructured text that has to be coded and relevant information extracted manually
- an outcome of scientific papers could be a computer-readable table of a file that is possible to just "save and use"
- maps in raster format have to be georeferenced and the information extracted in GIS
- archaeologists store their excavations in a structured list/database that is possible to download and extract relevant information
The Life Stages of Historical Data
There are several life stages of historical datasets. At point zero there is hard work of domain experts, that is historians but first of crucial part of any dataset creation is data consideration. That means there are no data without "guided" observation and collection of these observations. The data are not innocent information, they are constructed with a problem/research question in mind. This question manages the choice of sources for the dataset. Then follows an uneasy process of data preparation/extraction/transformation, which for most of the case in GEHIR project was still a lot of manual labor with the human brain as a bridging interface between books and spreadsheet. As any data scientist know, data preparation is usually the most time-consuming work. When the dataset is ready, an analysis of the research hypothesis follows. Also, an analyzed and processed dataset could be used for propagation of the topic in the form of a webpage, map, visualization, etc. Adhering to principles of open science, the last phase should be dataset publication.
Let's talk more about dataset structure.
Datasets variable domains/types
Majority of the dataset are matrices with columns identifying attributes and rows as cases. Before the data extraction, the structure of dataset has to be defined. Every attribute (column) has to have a domain, i.e. possible values (nominal, ordinal, quantitative, binary...). Often, the dataset consists of a more specific type:
- a date could have different granularity (the year of 1452, 2nd century, 11.09 14:57...) and a specific domain type (it is not a text and not a number) so the extraction and analysis could be a little tricky (see Table 1 for an example of handling date data)
- geographical coordinates are in most cases defined in latitude and longitude (in grades) that could be stored as a number. While it looks like a standard quantitative analysis, it requires a specific
- image is sometimes very interesting data from the perspective of storing and analyzing. Nowadays the computer vision and image analysis may contain methods that could be used in the historical research (automatic analysis of given artifact and the extraction of specific data)
- a hyperlink is a way to refer to a specific place in the different dataset. Therefore it is possible to extract or work with more information without filling the dataset with duplicates
- long free-form descriptions might help in a further qualitative research. It is also possible to use such a column to store interesting information that is not able to fit in a defined structure.
From the perspective of storing data, we have nowadays various possibilities:
- database (mysql, postgresql, nosql...) is the most robust and technically demanding solution used mainly for more complex datasets decomposed in the mutually related tables
- table (xls, csv, md ...) is very common as we have a broad selection of tool and the outcome could be easily transferable, sharable or readable
- GIS formats (geojson, ESRI Shapefile, KML...) allow easier handling of geographical data within the GIS software (QGIS, ArcGIS). It is also possible to store geographic information within a table or a database, but GIS formats often allow more topology types (lines, polygons...) and provide better possibilities for the geographic analysis
Examples of Historical Datasets
While working on the GEHIR project we came across a various collection of datasets that are ready-to-use in a historical research. Just to mention a few of them:
- DARMC (https://darmc.harvard.edu) - The Digital Atlas of Roman and Medieval Civilizations "makes freely available on the internet the best available materials for a Geographic Information Systems (GIS) approach to mapping and spatial analysis of the Roman and medieval worlds"
- AWMC (http://awmc.unc.edu/wordpress/, https://github.com/AWMC/geodata) - Ancient World Mapping center that "promotes cartography, historical geography, and geographic information science as essential disciplines within the field of ancient studies through innovative and collaborative research, teaching, and community outreach activities"
- Pelagion (http://commons.pelagios.org/) is more a gazzetter that connects data than a collection of datasets itself.
And finally the most interesting part - for the purpose of GEHIR project we coded few datasets that we would like to share this way. All of our data are geocoded and contain various attributes. Csv format was chosen as it is considered as the easiest readable one, possible to open in most tools and also our geographical topologies are limited to points (stored as x and y coordinates) so this format is sufficient.
Early Christian Churches
Table 1: Date Coding rules
|Date - precise
|Date - post quem
|Date - ante quem
|Ceased to exist
|datable to 392
|built by 366
|first attested in 353/4
|may 11, 218 b.c.e.
|march/april 52 c.e.
|the period 313-28
|330 BC - 640 AD (330 BCE - 640 CE)
|ca. 90–100 c.e.
|second half of second century to first half of third century c.e.
|early first century b.c.e.
|perhaps of the 4rd c.
|dating to the 1st half of the 4th c.
|in mid-4th c.
|late 4th c.
|end of 4rd c.
|by the end of the 4th
|from the last quarter of the 4th c.
|early decades of the 4th c
|built in about the 360s
|by the 340s
|teens of the 4th
|burnt in the 360s
|the late 4th c. or more probably the early 5th
|the last third of the 4th c., perhaps 385
|4th c. or 5th c.?
|possibly first or second century c.e.
- Anderson, Margo. Quantitative history. The Sage Handbook of Social Science Methodology, ed. William Outwaite and Stephen Turner, London: Sage Publications, 2007.