The Engineering History Project Database: Creating and Linking Datasets from Structured, Semi-Structured, and Unstructured Historical Sources PDF Free Download

1 / 35
0 views35 pages

The Engineering History Project Database: Creating and Linking Datasets from Structured, Semi-Structured, and Unstructured Historical Sources PDF Free Download

The Engineering History Project Database: Creating and Linking Datasets from Structured, Semi-Structured, and Unstructured Historical Sources PDF free Download. Think more deeply and widely.

[Preprint] The Engineering History Project Database: Creating and Liking Datasets from Structured, Semi-[Preprint] The Engineering History Project Database: Creating and Liking Datasets from Structured, Semi-
Structured, and Unstructured Historical SourcesStructured, and Unstructured Historical Sources
Israel Solares, Edward BeattyIsrael Solares, Edward Beatty
Publication DatePublication Date
16-10-2025
LicenseLicense
This work is made available under a CC BY 4.0 license and should only be used in accordance with that
license.
Citation for this work (American Psychological Association 7th edition)Citation for this work (American Psychological Association 7th edition)
Solares, I., & Beatty, E. (2025).
[Preprint] The Engineering History Project Database: Creating and Liking
Datasets from Structured, Semi-Structured, and Unstructured Historical Sources
(Version 1). University of
Notre Dame. https://doi.org/10.7274/30376324.v1
This work was downloaded from CurateND, the University of Notre Dame's institutional repository.
For more information about this work, to report or an issue, or to preserve and share your original work,
please contact the CurateND team for assistance at curate@nd.edu.
The Engineering History Project Database: Creating and Linking Datasets
from Structured, Semi-Structured, and Unstructured Historical Sources
[Preprint] October 2025
Israel G. Solares, UNAM (Mexico), israel.garcia@iimas.unam.mx
Edward Beatty, University of Notre Dame, ebeatty@nd.edu
Abstract
Using three different types of digitized historical sources one containing structured
information, one semi structured, and one unstructured we construct a relational
database that connects individuals, firms, and textual material related to individuals and
firms. The research project examines the emergence of professional engineering, 1870-
1930, and uses the global mining sector as a case study. This paper explains the methods
used to construct the initial three constituent datasets, including techniques to clean and
validate each. It then explains the methods used to transform and link those datasets,
creating a relational database that includes information on roughly 130,000 individuals,
over 50,000 firms, and almost 400,000 journal articles. We are able to trace individuals,
firms, and technologies over time and space and identify interconnected communities and
networks in a globalized setting. This is a preprint version.
KEYWORDS: History of technology; engineering; qualitative historical data; digital
methods; digital humanities; OCR; workflow; data linkage; global mining history.
Introduction
We live in a world fully engineered, the origins of which lie in the relatively recent past.
Engineering emerged and expanded as a modern profession between 1870 and the 1930s.
Although technical training schools have deeper historical roots in places like Germany
and France, the major institutions of the modern profession university-based degree
programs, professional associations, trade journals, and employment in large
organizations, both public and private proliferated globally in the late nineteenth
century and the early twentieth. By the 1930s, engineering had quickly become the
largest, the fastest growing, and arguably the most globally mobile profession in the
world. University degree programs and other technical training initiatives could be found
in countries on every continent and formed the basis for engineers’ claim to expertise.
Professional associations attracted thousands of members at local, regional, and national
levels and proliferated globally. Technical journals regularly published the results of
research, field experience, and commercial activity across engineering subfields,
distributed technical knowledge across national boundaries and facilitated knowledge
exchange. And tens of thousands of individual engineers, most holding university
engineering degrees, sought work in private corporations and public agencies. Engineers
had become increasingly critical to twentieth century histories of technical knowledge,
economic growth, organizational planning, and technocratic governance (Beatty &
Solares, 2025).
This paper introduces three new linked datasets that enable us to see the highly
mobile and relational nature of the engineering profession during its formative first half
century. We detail the methods used to create each dataset, which are applicable to a
wide range of similar kinds of historical sources. The first dataset focuses on individuals,
built from the “structured” information published in iterative editions of university alumni
and professional association membership lists. It allows us to link individual engineers at
different times of their careers and build professional career biographies. The second
dataset focuses on firms, drawn from the “semi-structured” information in corporate
directories, each re-issued multiple times and covering large swaths of the global mining
and metallurgical sector, including financial, management, and technical information on
thousands of specific firms employing engineers. The third dataset focuses on the dense
prose content of trade journals. It incorporates and organizes “unstructured” text from
the Engineering and Mining Journalwith the largest international circulation at the time
including each published issue over fifty years.
Linking records across these three very different types of historical sources enables
us to map and analyze the global mobility, relationships, and transnational networks
established by engineers in their schooling, employment, and publishing. Linkage of
records has been central in many studies on historical demography, allowing researchers
to create longitudinal panels to evaluate hypotheses on an individual level (Abramitzky et
al., 2020, 2024; Hedefalk et al., 2015; Ruggles et al., 2018). Yet, rarely have these datasets
included several kinds of sources, and linkages are typically made within the same
administrative producer (usually, for example, a population census). In contrast, we
highlight the utility of linking across the three datasets. This allows us to identify and
examine patterns that transcend national and organizational spaces, highlighting the
mobility of individuals and the connections between them as well as with the
organizations that produced and employed them. Although these datasets focus primarily
on the mining and metallurgical sector between 1870 and 1930, they can be expanded to
include other engineering fields and economic sectors as well. Our methods can be
applied to a wide range of similarly structured, semi-structured, and unstructured
historical sources. The datasets are open-access and available to other researchers via our
website and in long term repositories.1
This is not the first large-N, data-centric effort to examine aspects of modern
engineering history. In the mining sector, for example, Kathleen Ochs collected career
data on engineering graduates from the Colorado School of Mines, Duncan Money has
built a dataset using membership lists for the leading professional association in the
copper sector, and Marco Bertilorenzi has a dataset of graduates of France’s Saint Etienne
School of Mines. In addition, William Maloney and Felipe Valencia Caicedo have collated
data on twentieth century engineering numbers across countries in the Western
Hemisphere (Maloney & Valencia Caicedo, 2022; Money, 2022; Ochs, 1992). Other
scholars have noted the importance of international networks among professional mining
engineers (and other technical experts). See, for example, Stephen Tuffnell on networks in
the gold mining sector, and David Pretel and Lino Camprubi on global networks of experts
in the history of technology (Pretel & Camprubí, 2018; Tuffnell, 2018). More recently,
Bamboo Ren et al. have been able to use student and university employees records to
assess trends in Chinese Academe in the first half of the twentieth century (Campbell &
Lee, 2020; Ren et al., 2020). The sources and database construction methods described in
this article represent a significant departure from these efforts in (a) the scale of the data
collection, (b) in the techniques used to collect, code, clean, and validate the data, and in
(c) our ability to link data across historical sources.
Our datasets have several unique features. First, we utilize sources produced by
different types of organizations, with different objectives, and exhibiting very different
internal characteristics. These include structured, semi-structured, and unstructured
qualitative historical evidence. Second, we present methods used to extract, code, clean,
and validate data across these different types of sources. Our methods can be utilized by
researchers working with similar types of historical evidence on other topics. Third and
most importantly, we present our methods for linking individuals across time, between
organizations, and across space. Part of the methods we describe below involve the
simplification and standardization of names of individuals and firms to maximize
opportunities for cross-source and cross-temporal linking, and algorithms of
disambiguation to avoid false links. These linkages reveal a degree of mobility and
relationship that is typically invisible (or at best anecdotal) in conventional studies based
on organization-centered archival data. Creating and then linking large datasets opens
substantial new opportunities for research, interpretation, and the identification of large-
1 The datasets can be located and downloaded through the Notre Dame digital repository, CurateND, and
cited as follows: Israel G. Solares and Edward Beatty (2025). Engineering History Project Dataset (Version
v.1) [Dataset]. CurateND. https://doi.org/10.7274/30108082.
scale temporal and spatial patterns via text mining, data visualization, mapping, and
network analysis.
Methods: Constructing the Research Data
Description and Features of the Historical Sources
Mining and metallurgical engineers constitute our case study. Mining engineering was the
first of the new fields of modern engineering to emerge as a distinct profession, following
military and civil engineering. In Europe, the Freiberg School, and in the Americas, the
Royal Mining College in Mexico City, both have eighteenth century roots. But technical
training programs everywhere exploded in the late nineteenth century, especially in the
United States, where over a dozen new university programs in mining and metallurgy
appeared, from MIT (1861) to the Colorado School of Mines (1874) (see appendix for full
list). At the same time, the American Institute of Mining Engineers (AIME) broke from the
long informal field of civil engineering in 1871. The Engineering and Mining Journal (1866),
published in New York, and the International Mining Manual (1886), published in London,
were already well-established authorities on mineral prices, company directories, and
technical information when the Wall Street Journal (1889) and the Financial Times (1888)
began their long print runs. Demand for gold, silver, and industrial metals drove a
globalization of mining expertise in the nineteenth and early twentieth centuries that
foreshadowed subsequent experiences in electrification, hydro-engineering, petroleum,
commercial agronomy, and development projects (for example). University degree
programs, professional associations, and journals of mining engineers were, arguably, the
first artifacts of globalized markets for knowledge and expertise in the twentieth century.
Globalizing engineering institutions along with the relationships, linkages, and
networks that engineers established were essential characteristics of what was, arguably,
the most impactful profession of the twentieth century. Engineers and engineering,
however, have been understudied and undervalued by historians. Historians’ ability to see
this global dynamic has long been impaired by a reliance on conventional sources,
produced by engineering institutions firmly rooted in local and national settings. Large-N
linked datasets can offer new perspectives and insights.
Details on our sources and annual coverage in each category can be found in the
on-line appendix to this paper. Each of the three types of historical sources we utilize has
distinctive content and structural characteristics and yields a distinct dataset. We
subsequently link the three into one relational database. A description of sources for each
dataset follows; Table 1 lists the field content of each and notes cell and record counts.
I. A “Structured” dataset that focuses on individuals, drawn from school alumni and
professional association membership lists. These sources are centered around individuals,
showing their names, affiliation and, in many cases, employment status and location. They
are published at regular intervals, incorporating new information on subsequent
employment and locations of graduates and members. They take the form of a single
column list, which implies a repetitive and structured presentation of data. Our list of
schools includes all major mining and metallurgical engineering degree-granting programs
in the United States, with nearly comprehensive coverage from 1870-1915, as well as
several major school programs in Mexico, England, Germany, and France. The association
data comes from the American Institute of Mining Engineering, covering 1873-1912, for a
long time the biggest engineering association in the United States.
II. A “Semi structured dataset that focuses on firms, drawn from corporate
directories. Published irregularly but repeatedly, these sources compile information on
firms in the mining and metallurgical sector. We rely heavily on two directories published
in New York and London, respectively. The information they contain takes the form of a
long single column list that includes some repetitive and structured data (such as a list of
top managers, often in multi-column form), semi-repetitive semi-structured data (such as
a narrative with the main characteristics of the firm) as well as nonrepetitive unstructured
data (such as notes on the particular characteristics of the firm). The directories include all
significant mining and metallurgical firms in the Anglo-American sphere, operating in
mining districts around the world.
III. An “Unstructured” dataset that focuses on the content of trade journals. This
source takes the form of text in independent informative pieces, long articles, images, and
advertisements in a multiple column format. The information, therefore, is highly
segmented, nonrepetitive and largely unstructured. The Engineering and Mining Journal
(covered 1871-1923) was the leading journal in mining and a global reference for
technology, prices, and corporate news in the sector; it covered industry developments
around the world.
The data available in the historical sources permit the extraction of the following
information:
Table 1: Potential Field Content of the Three Datasets
Structured Dataset
Semi structured Dataset
Unstructured Dataset
ID observation
ID observation
ID observation
Type of source
Type of source
Volume
Source
Source
Page
Page pdf
Original text
Original text
Original text
Simplified text
Type of text
Year reference
Year of Source
Year
Last name
Last name
Date
First name
First name
Day
Middle Name
Middle name
Month
Simplified name
Simplified name
Author
Position
Position
Title
Organization
Organization
Subtitle
Location
Location
Location
Longitude (attributed)
Longitude (attributed)
Latitude (attributed)
Latitude (attributed)
Nationality
Capital
Country
Minerals
Education
Output
Technology
17 variables
19 variables
14 variables
316,672 observations
611,291 observations
506,138 observations
129,599 Persons
52,792 Firms
396,961 Articles
NOTES: “Organization” refers to either university or employer; “Education” refers to
school attended; “Year reference” refers to the year of the source. Observations
correspond to each mention of an individual in the sources (thus the same individual may
have multiple observations), to each entry for a particular firm, and (in the Unstructured
dataset) to recognizable text, linking date, title, and author with the text of an entry or
article. See below on the relational structure of these three datasets. Cell and record
counts are as of June 2025.
The digitized historical documents for each type of source were downloaded
through HathiTrust. The different structural characteristics of the sources required
different methods of information mining and data cleaning. The next three sections detail
the methods and workflows we utilized to extract, clean, and organize each type of
qualitative data, before turning to our validation protocol, linking methods, and results.
Structured Data
Structured data is common in published historical sources such as annual or occasional
lists of students, alumni, or association members. Figure 1 presents our workflow for
constructing a dataset from structured data. Extraction begins using OCR provided with
the original download of a digitized historical source from Hathitrust, or, if no OCR was
provided or was of particular poor quality, obtained using Adobe. The next step involves
segmentation of the text between observations, which in the individual's dataset means
differentiating between the units of information: individuals and the information
pertaining to them. We then classify and clean the information to be able to identify the
different categories present in each observation. Both segmentation and classification
used regular expressions in tidyverse (Ooms, 2019; Silge & Robinson, 2016; Wickham &
Wickham, 2017; Yarberry, 2021). A combination of OpenRefine (Miller & Vielfaure, 2022),
manual edits and the use of the Virtual International Authority File (VIAF) authority was
used to search for individuals and to choose the best name match if present in the
dataset. Geoparsing from locational data followed, before passing a visual examination
test where common mistakes were identified, before repeating the process (Petrova-
Antonova & Tancheva, 2020). The structured datasets were then validated and compiled
into the database, before conducting a general validation of the Structured dataset.
Figure 1: Sequence of Operations for Structured Data
Units as rows of text are the original form of extraction for the database. The
classification process then identifies the type of information within each unit through
regular expressions, but keeping a unique row ID (which allows us to trace back the data
mining process). Name parsing for individuals was elaborated through a simplification of
the name. Names appeared in the sources in several forms, from full names with titles to
abbreviations with last names and initials. Some of them had different orthographies
(such as Mc or Mac, or elimination of "von" or "de"), which further complicated the
parsing. We simplified the orthographies using Regex in stringr and the structure into last
name and initials to consolidate the variations.2
Like similar datasets focused on firms and individuals, all three of our datasets can
be described as unbalanced data panels with a long or stacked structure. The voluntary
reporting, the transformations of corporate names and government agencies, and the
continuous emergence of new actors (individuals, universities, and corporations, among
others) in the mining world makes the construction of a comprehensive dataset
impossible.3 Nevertheless, we have no reason to suspect that missing observations are
nonrandom, as incentives to report did not change during the period. Figure 2 shows the
temporal density of individuals in the database, reflecting the rapid growth of engineering
2 R scripts and csvs are be available at (solaresig, 2020/2024).
3 See some of the implications of missing data in panel analysis with similar datasets (individuals that self
report in waves and firms in the market) in (Baltagi & Chang, 1994; Baltagi & Song, 2006; Young & Johnson,
2015).
between the 1870s and 1930s. Our Structured Dataset contains 315,141 observations of 9
variables, corresponding to 129,599 unique individuals. It is difficult to know the gender
composition from the dataset, but we used the genderdata package in R to make
historical predictions of sex assigned at birth (Blevins & Mullen, 2015; Mihaljević et al.,
2019). We predicted the sex assigned at birth of 70,834 individuals, almost 55% of the
sample: from them, about 2.6% might have been female. The geographical distribution of
individuals is heavily concentrated in the North Atlantic countries, although there is a
significant presence of work locations in mining regions across all continents (see Figure
3). We can identify 242,629 geolocations on the individual-centered dataset,
corresponding to 23,644 distinct locations for 98,501 different individuals.
Figure 2: Total number of individuals in the Structured dataset
Figure 3: Geolocation of individuals in the Structured dataset
Semi structured Data
Firms in the mining sector constitute the focus of the second dataset, drawn from
repeated editions of corporate directories. Using these sources involves a combination of
structured and unstructured data. Our first experiment with managing this data was with
the Mexican Mining Manual, in which we used a greater amount of visual classification
and parsing. In contrast, extracting the information from the more substantial corporate
directories published in New York and London (MYB and AMM) between 1887 and 1923
used minimal visual classification. In these, the structured section of the data typically
included the company's name and list of directors, but the rest of the information was
largely unstructured and non-consistent across observations. These two sets of
information required the same steps as the structured data in the individual's dataset but
with a higher level of complexity to consistently establish the character and naming of
firms (Figure 4).
Figure 4: Sequence of Operations for Semi-Structured Data
After classifying and cleaning the structured section of the information, we searched for
names and locations already present in the dataset, and used regular expressions to
identify the names and locations of places, such as mines, mills, and headquarters, and
the recognition of output types and capital for the firms. As firms’ self-reporting segments
this information, further searches following these methods can be made in the future. The
Named Entity Recognition for firms is more complex than the cleaning of individual names
in the structured dataset. Name parsing for organizations is more challenging because
organizations may change their name, consolidate or become absorbed by others, or
simply register as a subsidiary of another firm on occasion (Song et al. 2020). Naming
authorities contain a small amount of information for these entities, as thousands of firms
appeared and disappeared relatively quickly in the mining sector. Consequently, name
parsing involved a combination of cleaning through regular expressions in R using the
stringr package, searching the Virtual International Authority File (VIAF) database, and the
use of OpenRefine (Bianchini et al., 2021; Schneider, n.d.).4 The next step was a simple
geoparsing based on the locational data product of the classification.
The Semi structured Dataset contains 611,291 observations of 12 variables,
corresponding to 52,792 unique organizations. Figure 5 shows the temporal density of
companies, and Figure 6 shows the 5,510 distinct locations present (Figure 6). Significant
4 See the file corpis.R and combo.R in our on-line appendix for the specific criteria.
clustering occurs in London, New York, and Boston where the headquarters of many firms
are listed.
Figure 5. Total number of organizations in the Semi structured dataset.
Figure 6: Geolocation of organizations in the Semi structured dataset
Unstructured Data
Unstructured qualitative data in historical sources present a number of challenges, but
recent developments in pattern recognition have allowed researchers to identify entities
in semi structured and unstructured data, with the help of automatic methods using
machine learning (Lafia et al., 2023; Liu et al., 2022). Our work with the Engineering and
Mining Journal (EMJ) to create the Unstructured Dataset faced the following problems:
a. Character recognition. Complex layouts on the digitized page, composed of
multiple columns, headers, pictures and sections, which often produce faulty and
incomplete character recognition. A picture might be interpreted as text or text as
an image, creating complex patterns in the resulting text layer (Blomqvist et al.,
2023).
b. Segmentation. Articles are pieces of information with internal coherence,
but the breaks between them do not follow repetitive patterns embedded in the
text. In other words, if lists in the alumni lists, membership publications, and
corporate directories had a recognizable structure to separate the pieces of
information, the journal (and most newsprint formatting) does not provide these.
Section and subsection headings may be inconsistent, and single articles often
cross column or page boundaries. This implies that, even if the pieces of text are
recognizable by the OCR engine, it cannot easily make the proper distinctions
between the pieces of information the component elements of the journal
(Likforman-Sulem et al., 2007).
c. Classification. While vital information, such as location, names, and
employment, had a clear and repetitive structure in the other two datasets, the
unstructured nature of the journal implies that extracting this information might
face even greater challenges regarding entity recognition and disambiguation. We
therefore do not extract such information in specific fields, and the information
has to be acquired through ad hoc searches within the content of the “Text” field
of the resulting dataset (Bravo Balado et al., 2015).
d. Cleaning. Because classification of information is complex, it is also very
difficult to identify and remove mistakes, particularly in longer pieces of text. This
can be done solely in the structured parts of the article, such as author names
when they are present (Christen, 2012).
Image segmentation with a computer vision method was used to overcome these
limitations, although some remain in the final form of the database. Figure 7 outlines the
sequence and logic of the workflow process.
Figure 7: Sequence of Operations for Unstructured Data
The first step of the process was labeling. We constructed two initial datasets for
training and testing, and these were labeled by a research assistant following the
examples of a first test and using the process proposed by Label Studio (Lin et al., 2014;
Shen et al., 2021; Tkachenko et al., 2022; Welcome to Detectron2’s Documentation!
Detectron2 0.6 Documentation, n.d.). As the labeling was made on multiple computers,
the IDs of the COCO annotations are inconsistent, so consolidation required homogenizing
the labeling IDs. We also double-checked the labeling tasks, eliminating ambiguities in the
images labeled to feed into the training process. We included only five categories in the
labeling (page number, date, title, author, text), excluding page headers, images, captions
for images, and tables, as their information is often difficult to extract.
After the labeling was complete, the dataset was randomly divided into training
and validation at a split ratio of 0.85.5 After a model of predictions was produced, weights
were used to predict the layout of the pages in all volumes of EMJ (1870-1930), totaling
84,287 pages. The prediction result was divided into pages and segments of the image, to
which the Tesseract OCR package was applied. In short, each piece of information on the
page has locational data (x and y coordinates) and associated text. The next step was to
5 We used a 0.85 ratio as suggested for the training of the layout parser model by Shannon Shen (Layout-
Parser/Examples/Customizing Layout Models with Label Studio Annotation/Customizing Layout Models with
Label Studio Annotation.Ipynb at Main · Layout-Parser/Layout-Parser · GitHub, n.d.).
assign a hierarchy based on the locational data on each page, to rebuild the pieces of
information that might have been separated (paragraphs of the same article in different
columns, or interrupted by an image, etc.). After this, a new text segmentation based on
regular expressions was made to minimize the errors in the articles. In this step, we added
the category “subtitle” to account for pieces of information that have internal coherence
between them, even if titles separate them. We were more concerned with false matches
(adding pieces of information with no relationship between them), and this step was very
restrictive logically.
The Unstructured Dataset comprises 506,138 observations across 19 variables,
corresponding to 396,961 unique articles spanning over 40 years of the journal. Figure 8
shows the density of articles in the database. As can be seen, the curated nature of trade
news makes the accumulated information smoother than individuals and corporations,
but a similar temporal growth pattern is evident. We can identify and geolocate 246,495
pieces in the journal dataset, corresponding to 2,887 unique locations (Figure 9).6 The
geographic distribution of places mentioned in articles reinforces the U.S. and European
orientation of the journal and the industry, but also illustrates its globally dispersed
interests.
Figure 8: Total number of articles in the Unstructured dataset.
Figure 9: Geolocation of articles in the Unstructured dataset
66 Using the Google API for geolocation, the ggmap package and the countrycode package, both in R, for
mapping; see (Arel-Bundock et al., 2018; Kahle & Wickham, 2013; South, 2011)
Validation
One of the main challenges in data linkage is to validate data quality in the linked datasets,
the product of problems in record transcription, digitization, and linking algorithms
(Antonie et al. 2020; Bailey et al. 2020; 2023; Wisselgren et al. 2014). One of the
advantages of linkage within a professional community, over a relatively short period of
time, is that the risks of false positives in linked records are greatly reduced, compared to
the linkage across census years, for example. Nevertheless, validation of the data was an
integral part of producing each of the three datasets.
For the information in the Structured Dataset, there are four levels of possible
errors: errors in the extraction process through the OCR; errors in the cleaning process;
errors in classification between different kinds of information; and errors in parsing, both
for individuals and organizations. We conducted three assessments of each type of
mistake present in the database, measuring the most common errors and the overall
accuracy of the methodology. We separated the text-centered dataset into five categories
according to the quantity of data. Then, a randomized representative sample at 95%
significance of the rows of the database was evaluated by a researcher, comparing them
with the original sources and making informed decisions about the mistakes in the
process. The results are in Table 2.
Table 2. Validation of Representative Samples of the Structured Dataset
ALUMNI
AIME
N
384
382
CAcc
97.70%
99.00%
WAcc
Company/Organization
72.60%
89.70%
Year Regist
80.00%
Location
92.70%
Lastname
94.60%
86.60%
Firstname
93.90%
Middlename
67.00%
Position
76.80%
95.60%
Nationality
99.50%
Education
100.00%
Overall Accuracy
80.80%
94.00%
Notes: CAcc (Character Accuracy Rate, or 100% minus CER), WAcc (Word Accuracy Rate)
Although the overall accuracy was 87.39%, the relatively lower scores in Company
name recognition, position, and middle name motivated a general revision of the
cleaning, classifying, and parsing steps for the complete dataset.
Verification of information in the Semi structured Dataset began with integrating
information for another round of cleaning and testing and looking for the same kind of
mistakes as above in the information extracted. We applied the analysis to the
consolidated data and the results are available in Table 3.
Table 3. Validation of Representative Samples of the Semi structured Dataset
MYB
AMM
MMM
OVERALL
N
409
384
368
1161
CAcc
98.6%
99.5%
99.8%
99.3%
WAcc
Company/Organization
43.8%
73.8%
92.7%
69.2%
Capital
99.3%
98.8%
96.8%
98.3%
Year Regist
99.3%
98.3%
99.8%
99.1%
Location
75.8%
94.4%
89.0%
86.1%
Last name
88.8%
83.1%
74.1%
82.2%
First name
94.1%
97.6%
94.6%
95.4%
Middle name
91.7%
87.8%
76.5%
85.6%
Role
96.6%
98.3%
96.3%
97.1%
Overall
86.2%
91.5%
90.0%
89.1%
Notes: CAcc (Character Accuracy Rate), WAcc (Word Accuracy Rate). MYB = Mining Year
Book; AMM = American Mining Manual; MMM = Mexican Mining Manual.
The overall accuracy was 89.1%. However, as with the structured data, the
disparity of results with the original sources (especially regarding company names)
motivated a general revision of the cleaning, classifying, and parsing steps in the complete
dataset. Again, a randomized representative sample was completed at a 95% confidence
level. The accuracy results for the new datasets, for both individuals and firms, are
presented in Table 4.7
Table 4. Second validation of representative sample (WAcc)
Item
Individuals
Dataset Firms Dataset Average
Last name
98.02%
86.36%
94.01%
7 Se the files in the replicability package at (solaresig, 2020/2024)
Middle name
98.02%
93.18%
96.35%
First name
97.62%
93.18%
96.09%
Simplified name
96.03%
81.06%
90.89%
Education
100.00%
99.24%
99.74%
Organization
100.00%
84.09%
94.53%
Location
91.67%
98.48%
94.01%
Average
97.34%
90.80%
95.09%
We tested the matching process by measuring the possibilities of spurious matches
between datasets which include individuals between the Structured dataset and the Semi
structured dataset. A spurious match could be caused by: convergent errors in the OCR, in
which the errors randomly match on a third, not real, entity; convergent errors in the
cleaning and classification process; or coincidence in the last name, the first letter of the
first name, and the first letter of the middle name. We generated a randomized
representative sample at 95% significance for matching information through a simplified
name. Then a researcher validated the parsing as true or false positives by looking
manually at the sources and the contextual information. From all the above causes, false
positives represented just 1.83% of the test sample (n=382). After this process, a visual
examination of the most prominent 2,000 organizations in the database was conducted
and manually corrected for the errors described above. Most of the corrections involved
not the misidentification of the company but a spelling mistake that was carried over from
the OCR phase. No further accuracy assessment has been made on the manually corrected
datasets.
Validation of the Unstructured data required new levels of evaluation as the errors
could be different: errors in image segmentation; misclassification of text into articles;
misclassification of labels in the page; OCR errors; and errors in the reference system. As
Table 5 reports, higher accuracy came in detecting boundaries in text, followed by date
and title at much lower levels. This can be an expected result, as the larger objects are text
and the smaller objects are date and author. For the object detection model, the usual
metric used to assess validity is the Average Precision metric (AP). The metric summarizes
the area under the precision-recall curve, or it can also be defined as the weighted mean
of that relationship.8 The accuracy metrics greatly improved after applying regular
8 While precision is the coefficient between the True Positives (TP) as a proportion of all possible positives
detected (true positives and false positives), recall measures the coefficient between TP and all possible
positives (TP and False Negatives). Usually there is an inverse relationship between Precision and Recall, and
the precision-recall curve measures that relationship, while the AP metric summarizes that curve.
expressions. The classification and cleaning process focused on the structured data
provided after the segmentation: specific dates, correcting the page references, and
cleaning the authors' names. We conducted NER based on the previous two datasets,
extracting locations and corporate names from the unstructured text. Finally, a human
validated a representative sample of the information and the comprehension of the
unstructured sections to prove the consistency in segmenting the paragraphs in articles
and news reports (coherent pieces of information), along with CAcc and WAcc measures.
Table 6 shows the results of this human validation.
Table 5. Results of the Training
Category
Average
Precision
Author
22.473
Date
29.107
Page
13.324
Text
63.920
Title
26.023
General
30.969
Note: see text for explanation.
The results in Table 6 show that the categories with the most mistakes in
classification are Page and Subtitle, with fewer errors in Type and Date. We divided the
classification mistakes in the text field into two categories: one counts the times when the
cell did not capture the whole information (under), and the second measures the
observation when the cell captured data from a different title (over). As shown in Table 6,
the algorithm privileged the reduction of errors that mixed information. The classification
accuracy of the Text category is 94.81%, and the overall accuracy of the dataset is 93.38%.
A different accuracy evaluation was made for character recognition and word accuracy.
We considered the whole text extracted (title, subtitle, and text) as a unit of information
and evaluated the structured information (date, page, and author) separately.
The recognition errors in the structured section of the data tend to be lower in the
case of dates and authors, which is expected, and the lower number of errors in the page
shows that most errors came from the algorithm that corrected the page number for all
pages. As the author’s list could be matched with the list of names in the wider dataset,
the matching of names was more restrictive, not only using the generalized name but
assuring a matching in three categories (last name, first name and simplified name). In the
case of the articles, we evaluated the totality of the unstructured pieces, as most of the
classification errors in the title, subtitle, and text sections came from misidentification of
the boundaries between headers and the body of text. The general result keeps the
character errors around 2% and the word errors around 6%.
Table 6. Validation of resulting dataset
Item
Classification
Recognition
Errors Mistakes Accuracy
Character
Errors
Word
errors CAcc WAcc
Type
7
1.82%
98.18%
Page 73 18.96% 81.04% 3 3 97.37%
92.31
%
Date 2 0.52% 99.48% 30 18 93.64%
81.82
%
Author 12 3.12% 96.88% 4 2 91.3%
75.00
%
Title
35
9.09%
90.91%
12950 5991 97.66%
93.88
%
Subtitle
55
14.29%
85.71%
Text
(under) 12 3.12% 96.88%
Text
(over) 8 2.08% 97.92%
Overall 204 6.62% 93.38% 12987 6014 97.66%
93.86
%
Notes: CAcc (Character Accuracy Rate), WAcc (Word Accuracy Rate). Column 4 “accuracy”
refers to accuracy in the classification of layouts of text rather than mistakes in the text
extracted.
Linking Data: the Relational Character of the Datasets
The relational database linking the Structured, Semi structured and Unstructured datasets
was compiled in a tidy format using R software, yielding a rectangular data frame where
variables, or columns, contain values measured in the same unit or provided in the same
data type, and observations comprise rows. (Wickham 2014; Silge and Robinson 2016).
Processing transformed the information from rows of text into text associated with three
types of entities: individuals, organizations and articles. In other words, from sources with
structured, semi structured and unstructured information, the classification process
produces three types of datasets: one centered around individuals (in which individuals
are the units), one centered around organizations/companies (in which companies are the
units), and one centered around text (in which the journal articles are units). The variables
contain attributes of individuals (place of origin, education, employment, etc.), firms
(capital, incorporation, technology, employees, etc.), and text content (author, location,
title, date, page, etc.).
We conducted an assessment of the linked data from the Individuals and Firms
sources by examining the matching of names across different sources. This assessment
focused on the probability of false positives or locating spurious information matches
from different sources through a common simplified name. While false negatives in
matching (by minor variations on orthography or variations in the reporting of middle
names) are possible in the database, false positivity is a more significant concern as it
might overestimate the connectivity of individuals and organizations in the dataset. In
short, type I errors (false positives) are considered more costly for this analysis than the
existence of type II errors (false negatives), and, therefore, the minimization of false
positives is privileged.
The Unstructured, trade Journal dataset was integrated based on these heavily
transformed Structured and Unstructured datasets. The names of the individual authors,
derived as a structured subset of the trade journal dataset, were fuzzy-matched with
existing names in the other datasets and assigned the variables of Last name, Middle
name and Simplified name. We extracted the names of firms and locations recognized in
the previous steps, in order to integrate the unstructured dataset with the larger
database.
Figure 10 shows the steps in the integration of the three datasets. A final dataset
integration implied the creation of comparable columns across all datasets. The three
datasets had these variables in common: name of individuals, name of organizations,
longitude data, latitude data, year, source, and location. Nevertheless, the locational
column had different specificity levels, also reflected in the coordinates. In order to
homogenize the different levels of the organization, we created a locn column using the
OSM dataset,9 transforming the coordinates into a homogeneous and modern location
name. Individuals, organizations, and homogenized locations were then used as catalogs
to create a relational dataset. Figure 11 shows the final structure of the resulting
relational database. The project provides the user with the option of downloading the
integrated relational database as a mysql file, or any one or more of the three separated
datasets, and implementing their own disambiguation algorithms.
Figure 10: General integration of the relational database
9 Using the rev_geocodeOSM function from the tmap library (Tennekes, 2018).
Figure 11: Entity Relationship Diagram
Conclusion
This paper has explained the methods we used to create datasets relying on very different
types of digitized historical sources, with different internal structures but with thematic
coherence. Extracting information from the sources required the use of a range of tools,
from regular expressions to machine learning. The paper also shows the different stages
of data validation necessary to constitute a linked database across the different sources
and datasets. Because of the variety of information we are able to extract from the
historical sources and organize across the three constituent datasets, we are able to
explore new dimensions of the emergence of professional engineering in the late
nineteenth and early twentieth centuries globally.
Historical work has long been constrained to archival and print sources that focus
on a particular place, organization, or event. The logistics of analog research dictated
these constraints. As a result, histories typically focused on local, organizational, or
national scales of analysis, as evidence was often produced, collected, and archived by
nation states, private organizations, or their analogs. Recently, however, access to
digitized historical sources combined with techniques to collect, process, and analyze large
corpura of historical information has dramatically opened new possibilities.
The paper demonstrates that the use of this mixed-methods approach can provide
a clearer picture of interconnected communities that produce their own narratives and
interact in globalized settings. We suggest that the linkage of different types of
thematically related datasets reveals a rich world of connectedness in the formation of
professionalized groups that produce their own records in standardized formats, but that
are not regulated by centralized agencies. These types of sources can provide a new
account of cross-border and transnational connections. We expect that the methods and
approach outlined here could be applied to a range of similar kinds of historical data in
structured, semi structured, and unstructured formats.
Acknowledgements
This project was supported by a research grant from the National Science Foundation
(Science and Technology Studies Program, #2020926) as well as seed funding from the
Kellogg Institute for International Studies at the University of Notre Dame. Research
assistance was provided by a number of Notre Dame undergraduates, including Arianna
Bordogna-Jurkowitz, Nathaniel Burke, Jack Wang, and Clark Wu. The authors also thank
Summer Mengarelli of the Navari Family Center for Digital Scholarship for her expert
counsel.
Appendix: Historical Sources List
Alumni lists
University of Arizona, 1895-1915
Camborne School of Mines (UK), 1898-1940
Colorado School of Mines, 1883-1905
Columbia University, School of Mines, 1867-1929
Escuela Nacional Ingenieros (Mexico), 1868-1905
École National des Mines (France), 1871-1922
École Nationale des Mines de Saint-Étienne (France), 1871-1922
BergAkademie Freiberg (Germany), 1766-1939
Harvard University, 1811-1919
Lehigh University, 1874-1917
Michigan College of Mines, 1888-1909
Missouri School of Mines, 1874-1915
MIT, 1861-1940
Montana School of Mines, 1903-1925
Pennsylvania State College, School of Mines, 1861-1916
Royal School of Mines (UK), 1812-1920
Stanford University, 1892-1899
University of Minnesota, Minnesota School of Mines, 1902-1924
University of Nevada, Mackay School of Mines, 1876-1917
University of New Mexico, 1890-1921
University of South Dakota, 1888-1915
University of West Virginia, 1905-1919/20
Worcester Polytechnic, 1871-1926
Yale University, Sheffield Scientific School (Mining and Metallurgy),
Total number of cells: 1,980,712
Total number of individuals: 66591
Professional Association AIME
American Institute of Mining Engineers. [members & associates lists]
1873-1896, 1896, 1898-99, 1901-03, 1906, 1912
Total number of cells: 289,548
Corporate Directories
American Mining Manual (AMM; sometimes International Mining Manual), embracing the
principal operating metal mines, mills, smelting & refining plants of the United States,
Mexico and Canada and coal mines of the western states, Mexico and Canada. Alexander
R. Dunbar, ed. Denver: Western Mining Directory Co.: 1907, 1911, 1916, 1919, 1920, 1922
Mining Year Book (MYB) containing full particulars of mining companies. Walter R.
Skinner. London: The Financial Times: 1887-1889, 1891, 1893-1895, 1903, 1908-1923
The Mexican Magazine (Handbook of Mines). México, D.F. 1926
Total number of cells: 15,141,420
Trade Journal
Engineering and Mining Journal, all editions 1870-1930
Further detail on coding files and sources is available in our on-line appendix, found at:
https://github.com/solaresig/Blueprint-for-Modernity
References
Abramitzky, R., Boustan, L., Eriksson, K., Feigenbaum, J., & Pérez, S. (2024). Automated
Linking of Historical Data. Journal of Economic Literature.
Abramitzky, R., Mill, R., & Pérez, S. (2020). Linking individuals across historical sources: A
fully automated approach. Historical Methods, 53(2), 94111.
https://doi.org/10.1080/01615440.2018.1543034
Arel-Bundock, V., Enevoldsen, N., & Yetman, C. J. (2018). countrycode: An R package to
convert country names and country codes. Journal of Open Source Software, 3(28),
848.
Baltagi, B. H., & Chang, Y.-J. (1994). Incomplete panels: A comparative study of alternative
estimators for the unbalanced one-way error component regression model.
Journal of Econometrics, 62(2), 6789.
Baltagi, B. H., & Song, S. H. (2006). Unbalanced panel data: A survey. Statistical Papers,
47(4), 493523.
Beatty, E., & Solares, I. G. (2025). Introduction: Toward an Engineered World. In An
Engineered World: The Role of Engineers in Global Modernity (pp. 129). MIT Press.
Bianchini, C., Bargioni, S., & di San Girolamo, C. C. P. (2021). Beyond VIAF. Information
Technology and Libraries, 40(2).
Blevins, C., & Mullen, L. (2015). Jane, John... Leslie? A Historical Method for Algorithmic
Gender Prediction. DHQ: Digital Humanities Quarterly, 9(3).
Blomqvist, C., Enflo, K., Jakobsson, A., & Åström, K. (2023). Reading the ransom:
Methodological advancements in extracting the Swedish Wealth Tax of 1571.
Explorations in Economic History, 87, 101470.
https://doi.org/10.1016/j.eeh.2022.101470
Bravo Balado, A., De Boer, V., & Schreiber, G. (2015). Linking Historical Ship Records to a
Newspaper Archive. In L. M. Aiello & D. McFarland (Eds.), Social Informatics (Vol.
8852, pp. 254263). Springer International Publishing.
https://doi.org/10.1007/978-3-319-15168-7_32
Campbell, C., & Lee, J. (2020). Historical Chinese Microdata. 40 Years of Dataset
Construction by the Lee-Campbell Research Group. Historical Life Course Studies, 9,
130157. https://doi.org/10.51964/hlcs9303
Christen, P. (2012). Data Matching: Concepts and Techniques for Record Linkage, Entity
Resolution, and Duplicate Detection. Springer Berlin Heidelberg.
https://doi.org/10.1007/978-3-642-31164-2
Hedefalk, F., Harrie, L., & Svensson, P. (2015). Methods to Create a Longitudinal Integrated
Demographic and Geographic Database on the Micro-Level: A Case Study of Five
Swedish Rural Parishes, 18131914. Historical Methods: A Journal of Quantitative
and Interdisciplinary History, 48(3), 153173.
https://doi.org/10.1080/01615440.2015.1016645
Kahle, D. J., & Wickham, H. (2013). ggmap: Spatial visualization with ggplot2. R J., 5(1),
144.
Lafia, S., Bleckley, D. A., & Alexander, J. T. (2023). Digitizing and parsing semi-structured
historical administrative documents from the G.I. Bill mortgage guarantee
program. Journal of Documentation, 79(7), 225239. https://doi.org/10.1108/JD-
03-2023-0055
Layout-parser/examples/Customizing Layout Models with Label Studio
Annotation/Customizing Layout Models with Label Studio Annotation.ipynb at
main · Layout-Parser/layout-parser · GitHub. (n.d.). Retrieved September 2, 2025,
from https://github.com/Layout-Parser/layout-
parser/blob/main/examples/Customizing%20Layout%20Models%20with%20Label
%20Studio%20Annotation/Customizing%20Layout%20Models%20with%20Label%
20Studio%20Annotation.ipynb
Likforman-Sulem, L., Zahour, A., & Taconet, B. (2007). Text line segmentation of historical
documents: A survey. International Journal of Document Analysis and Recognition
(IJDAR), 9(2), 123138. https://doi.org/10.1007/s10032-006-0023-z
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C.
L. (2014). Microsoft coco: Common objects in context. European Conference on
Computer Vision, 740755.
Liu, Z., Wang, H., & Bol, P. K. (2022). Automatic biographical information extraction from
local gazetteers with Bi-LSTM-CRF model and BERT. International Journal of Digital
Humanities, 4(13), 195212. https://doi.org/10.1007/s42803-022-00059-2
Maloney, W. F., & Valencia Caicedo, F. (2022). Engineering Growth. Journal of the
European Economic Association, 20(4), 15541594.
https://doi.org/10.1093/jeea/jvac014
Mihaljević, H., Tullney, M., Santamaría, L., & Steinfeldt, C. (2019). Reflections on gender
analyses of bibliographic corpora. Frontiers in Big Data, 2, 29.
Miller, M., & Vielfaure, N. (2022). OpenRefine: An approachable open tool to clean
research data. Bulletin-Association of Canadian Map Libraries and Archives
(ACMLA), 170.
Money, D. J. (2022). ‘American Mining Engineers and the Global Copper Industry, c.1880-
1945’,. In Born with a Copper Spoon: A Global History of Copper, 1830-1980. UBC
Press.
Ochs, K. H. (1992). The Rise of American Mining Engineers: A Case Study of the Colorado
School of Mines. Technology and Culture, 33(2), 278.
https://doi.org/10.2307/3105859
Ooms, J. (2019). tesseract: Open Source OCR Engine. See Https://CRAN. R-Project.
Org/Package= Tesseract. R Package Version, 4.
Petrova-Antonova, D., & Tancheva, R. (2020). Data cleaning: A case study with OpenRefine
and trifacta wrangler. International Conference on the Quality of Information and
Communications Technology, 3240.
Pretel, D., & Camprubí, L. (2018). Technology and Globalisation: Networks of Experts in
World History. Springer.
Ren, B. Y., Liang, C., & Lee, J. Z. (2020). Meritocracy and the Making of the Chinese
Academe, 19121952. The China Quarterly, 244, 942968.
https://doi.org/10.1017/S0305741020001289
Ruggles, S., Fitch, C. A., & Roberts, E. (2018). Historical Census Record Linkage. Annual
Review of Sociology, 44(1), 1937. https://doi.org/10.1146/annurev-soc-073117-
041447
Schneider, S. (n.d.). viafr: Interface to the “VIAF” ('Virtual International Authority File’).
API. R Package Version 0.2.0. https://CRAN.R-project.org/package=viafr
Shen, Z., Zhang, R., Dell, M., Lee, B. C. G., Carlson, J., & Li, W. (2021). Layoutparser: A
unified toolkit for deep learning based document image analysis. Document
Analysis and RecognitionICDAR 2021: 16th International Conference, Lausanne,
Switzerland, September 510, 2021, Proceedings, Part I 16, 131146.
Silge, J., & Robinson, D. (2016). tidytext: Text mining and analysis using tidy data principles
in R. Journal of Open Source Software, 1(3), 37.
Solares, I. G., & Beatty, E. (2025). Engineering History Project Dataset (Version v.1)
[Dataset]. CurateND. https://doi.org/10.7274/30108082
solaresig. (2024). Solaresig/Blueprint-for-Modernity [Computer software].
https://github.com/solaresig/Blueprint-for-Modernity (Original work published
2020)
South, A. (2011). rworldmap: A new R package for mapping global data. R Journal, 3(1).
Tennekes, M. (2018). tmap: Thematic Maps in R. Journal of Statistical Software, 84, 139.
Tkachenko, M., Malyuk, M., Holmanyuk, A., & Liubimov, N. (2022). Label studio: Data
labeling software, 2020-2022. Open Source Software Available from
Https://Github. Com/Heartexlabs/Label-Studio, 3.
Tuffnell, S. (2018). Engineering Gold Rushes: Engineers and the Mechanics of Global
Connectivity. In A Global History of Gold Rushes (pp. 229251). University of
California Press.
Welcome to detectron2’s documentation! Detectron2 0.6 documentation. (n.d.).
Retrieved September 2, 2025, from https://detectron2.readthedocs.io/en/latest/
Wickham, H., & Wickham, M. H. (2017). Package ‘tidyr.’ Easily Tidy Data
with’spread’and’gather ()’Functions.
Yarberry, W. (2021). Stringr. In CRAN recipes: DPLYR, Stringr, Lubridate, and RegEx in R
(pp. 59107). Springer.
Young, R., & Johnson, D. R. (2015). Handling missing values in longitudinal panel data with
multiple imputation. Journal of Marriage and Family, 77(1), 277294.