The Engineering History Project Database: Creating and Linking Datasets from Structured, Semi-Structured, and Unstructured Historical Sources PDF Free Download

Name: The Engineering History Project Database: Creating and Linking Datasets from Structured, Semi-Structured, and Unstructured Historical Sources PDF
Author: Mary Allen

1 / 35

0 views•35 pages

The Engineering History Project Database: Creating and Linking Datasets from Structured, Semi-Structured, and Unstructured Historical Sources PDF Free Download

The Engineering History Project Database: Creating and Linking Datasets from Structured, Semi-Structured, and Unstructured Historical Sources PDF free Download. Think more deeply and widely.

[Preprint] The Engineering History Project Database: Creating and Liking Datasets from Structured, Semi-[Preprint] The Engineering History Project Database: Creating and Liking Datasets from Structured, Semi-

Structured, and Unstructured Historical SourcesStructured, and Unstructured Historical Sources

Israel Solares, Edward BeattyIsrael Solares, Edward Beatty

Publication DatePublication Date

16-10-2025

LicenseLicense

This work is made available under a CC BY 4.0 license and should only be used in accordance with that

license.

Citation for this work (American Psychological Association 7th edition)Citation for this work (American Psychological Association 7th edition)

Solares, I., & Beatty, E. (2025).

[Preprint] The Engineering History Project Database: Creating and Liking

Datasets from Structured, Semi-Structured, and Unstructured Historical Sources

(Version 1). University of

Notre Dame. https://doi.org/10.7274/30376324.v1

This work was downloaded from CurateND, the University of Notre Dame's institutional repository.

For more information about this work, to report or an issue, or to preserve and share your original work,

please contact the CurateND team for assistance at curate@nd.edu.

The Engineering History Project Database: Creating and Linking Datasets

from Structured, Semi-Structured, and Unstructured Historical Sources

[Preprint] October 2025

Israel G. Solares, UNAM (Mexico), israel.garcia@iimas.unam.mx

Edward Beatty, University of Notre Dame, ebeatty@nd.edu

Abstract

Using three different types of digitized historical sources – one containing structured

information, one semi structured, and one unstructured – we construct a relational

database that connects individuals, firms, and textual material related to individuals and

firms. The research project examines the emergence of professional engineering, 1870-

1930, and uses the global mining sector as a case study. This paper explains the methods

used to construct the initial three constituent datasets, including techniques to clean and

validate each. It then explains the methods used to transform and link those datasets,

creating a relational database that includes information on roughly 130,000 individuals,

over 50,000 firms, and almost 400,000 journal articles. We are able to trace individuals,

firms, and technologies over time and space and identify interconnected communities and

networks in a globalized setting. This is a preprint version.

KEYWORDS: History of technology; engineering; qualitative historical data; digital

methods; digital humanities; OCR; workflow; data linkage; global mining history.

Introduction

We live in a world fully engineered, the origins of which lie in the relatively recent past.

Engineering emerged and expanded as a modern profession between 1870 and the 1930s.

Although technical training schools have deeper historical roots in places like Germany

and France, the major institutions of the modern profession – university-based degree

programs, professional associations, trade journals, and employment in large

organizations, both public and private – proliferated globally in the late nineteenth

century and the early twentieth. By the 1930s, engineering had quickly become the

largest, the fastest growing, and arguably the most globally mobile profession in the

world. University degree programs and other technical training initiatives could be found

in countries on every continent and formed the basis for engineers’ claim to expertise.

Professional associations attracted thousands of members at local, regional, and national

levels and proliferated globally. Technical journals regularly published the results of

research, field experience, and commercial activity across engineering subfields,

distributed technical knowledge across national boundaries and facilitated knowledge

exchange. And tens of thousands of individual engineers, most holding university

engineering degrees, sought work in private corporations and public agencies. Engineers

had become increasingly critical to twentieth century histories of technical knowledge,

economic growth, organizational planning, and technocratic governance (Beatty &

Solares, 2025).

This paper introduces three new linked datasets that enable us to see the highly

mobile and relational nature of the engineering profession during its formative first half

century. We detail the methods used to create each dataset, which are applicable to a

wide range of similar kinds of historical sources. The first dataset focuses on individuals,

built from the “structured” information published in iterative editions of university alumni

and professional association membership lists. It allows us to link individual engineers at

different times of their careers and build professional career biographies. The second

dataset focuses on firms, drawn from the “semi-structured” information in corporate

directories, each re-issued multiple times and covering large swaths of the global mining

and metallurgical sector, including financial, management, and technical information on

thousands of specific firms employing engineers. The third dataset focuses on the dense

prose content of trade journals. It incorporates and organizes “unstructured” text from

the Engineering and Mining Journal – with the largest international circulation at the time

– including each published issue over fifty years.

Linking records across these three very different types of historical sources enables

us to map and analyze the global mobility, relationships, and transnational networks

established by engineers in their schooling, employment, and publishing. Linkage of

records has been central in many studies on historical demography, allowing researchers

to create longitudinal panels to evaluate hypotheses on an individual level (Abramitzky et

al., 2020, 2024; Hedefalk et al., 2015; Ruggles et al., 2018). Yet, rarely have these datasets

included several kinds of sources, and linkages are typically made within the same

administrative producer (usually, for example, a population census). In contrast, we

highlight the utility of linking across the three datasets. This allows us to identify and

examine patterns that transcend national and organizational spaces, highlighting the

mobility of individuals and the connections between them as well as with the

organizations that produced and employed them. Although these datasets focus primarily

on the mining and metallurgical sector between 1870 and 1930, they can be expanded to

include other engineering fields and economic sectors as well. Our methods can be

applied to a wide range of similarly structured, semi-structured, and unstructured

historical sources. The datasets are open-access and available to other researchers via our

website and in long term repositories.1

This is not the first large-N, data-centric effort to examine aspects of modern

engineering history. In the mining sector, for example, Kathleen Ochs collected career

data on engineering graduates from the Colorado School of Mines, Duncan Money has

built a dataset using membership lists for the leading professional association in the

copper sector, and Marco Bertilorenzi has a dataset of graduates of France’s Saint Etienne

School of Mines. In addition, William Maloney and Felipe Valencia Caicedo have collated

data on twentieth century engineering numbers across countries in the Western

Hemisphere (Maloney & Valencia Caicedo, 2022; Money, 2022; Ochs, 1992). Other

scholars have noted the importance of international networks among professional mining

engineers (and other technical experts). See, for example, Stephen Tuffnell on networks in

the gold mining sector, and David Pretel and Lino Camprubi on global networks of experts

in the history of technology (Pretel & Camprubí, 2018; Tuffnell, 2018). More recently,

Bamboo Ren et al. have been able to use student and university employees records to

assess trends in Chinese Academe in the first half of the twentieth century (Campbell &

Lee, 2020; Ren et al., 2020). The sources and database construction methods described in

this article represent a significant departure from these efforts in (a) the scale of the data

collection, (b) in the techniques used to collect, code, clean, and validate the data, and in

Our datasets have several unique features. First, we utilize sources produced by

different types of organizations, with different objectives, and exhibiting very different

internal characteristics. These include structured, semi-structured, and unstructured

qualitative historical evidence. Second, we present methods used to extract, code, clean,

and validate data across these different types of sources. Our methods can be utilized by

researchers working with similar types of historical evidence on other topics. Third and

most importantly, we present our methods for linking individuals across time, between

organizations, and across space. Part of the methods we describe below involve the

simplification and standardization of names of individuals and firms to maximize

opportunities for cross-source and cross-temporal linking, and algorithms of

disambiguation to avoid false links. These linkages reveal a degree of mobility and

relationship that is typically invisible (or at best anecdotal) in conventional studies based

on organization-centered archival data. Creating and then linking large datasets opens

substantial new opportunities for research, interpretation, and the identification of large-

1 The datasets can be located and downloaded through the Notre Dame digital repository, CurateND, and

cited as follows: Israel G. Solares and Edward Beatty (2025). Engineering History Project Dataset (Version

v.1) [Dataset]. CurateND. https://doi.org/10.7274/30108082.

scale temporal and spatial patterns via text mining, data visualization, mapping, and

network analysis.

Methods: Constructing the Research Data

Description and Features of the Historical Sources

Mining and metallurgical engineers constitute our case study. Mining engineering was the

first of the new fields of modern engineering to emerge as a distinct profession, following

military and civil engineering. In Europe, the Freiberg School, and in the Americas, the

Royal Mining College in Mexico City, both have eighteenth century roots. But technical

training programs everywhere exploded in the late nineteenth century, especially in the

United States, where over a dozen new university programs in mining and metallurgy

appeared, from MIT (1861) to the Colorado School of Mines (1874) (see appendix for full

list). At the same time, the American Institute of Mining Engineers (AIME) broke from the

long informal field of civil engineering in 1871. The Engineering and Mining Journal (1866),

published in New York, and the International Mining Manual (1886), published in London,

were already well-established authorities on mineral prices, company directories, and

technical information when the Wall Street Journal (1889) and the Financial Times (1888)

began their long print runs. Demand for gold, silver, and industrial metals drove a

globalization of mining expertise in the nineteenth and early twentieth centuries that

foreshadowed subsequent experiences in electrification, hydro-engineering, petroleum,

commercial agronomy, and development projects (for example). University degree

programs, professional associations, and journals of mining engineers were, arguably, the

first artifacts of globalized markets for knowledge and expertise in the twentieth century.

Globalizing engineering institutions along with the relationships, linkages, and

networks that engineers established were essential characteristics of what was, arguably,

the most impactful profession of the twentieth century. Engineers and engineering,

however, have been understudied and undervalued by historians. Historians’ ability to see

this global dynamic has long been impaired by a reliance on conventional sources,

produced by engineering institutions firmly rooted in local and national settings. Large-N

linked datasets can offer new perspectives and insights.

Details on our sources and annual coverage in each category can be found in the

on-line appendix to this paper. Each of the three types of historical sources we utilize has

distinctive content and structural characteristics and yields a distinct dataset. We

subsequently link the three into one relational database. A description of sources for each

dataset follows; Table 1 lists the field content of each and notes cell and record counts.

I. A “Structured” dataset that focuses on individuals, drawn from school alumni and

professional association membership lists. These sources are centered around individuals,

showing their names, affiliation and, in many cases, employment status and location. They

are published at regular intervals, incorporating new information on subsequent

employment and locations of graduates and members. They take the form of a single

column list, which implies a repetitive and structured presentation of data. Our list of

schools includes all major mining and metallurgical engineering degree-granting programs

in the United States, with nearly comprehensive coverage from 1870-1915, as well as

several major school programs in Mexico, England, Germany, and France. The association

data comes from the American Institute of Mining Engineering, covering 1873-1912, for a

long time the biggest engineering association in the United States.

II. A “Semi structured” dataset that focuses on firms, drawn from corporate

directories. Published irregularly but repeatedly, these sources compile information on

firms in the mining and metallurgical sector. We rely heavily on two directories published

in New York and London, respectively. The information they contain takes the form of a

long single column list that includes some repetitive and structured data (such as a list of

top managers, often in multi-column form), semi-repetitive semi-structured data (such as

a narrative with the main characteristics of the firm) as well as nonrepetitive unstructured

data (such as notes on the particular characteristics of the firm). The directories include all

significant mining and metallurgical firms in the Anglo-American sphere, operating in

mining districts around the world.

III. An “Unstructured” dataset that focuses on the content of trade journals. This

source takes the form of text in independent informative pieces, long articles, images, and

advertisements in a multiple column format. The information, therefore, is highly

segmented, nonrepetitive and largely unstructured. The Engineering and Mining Journal

(covered 1871-1923) was the leading journal in mining and a global reference for

technology, prices, and corporate news in the sector; it covered industry developments

around the world.

The data available in the historical sources permit the extraction of the following

information:

Table 1: Potential Field Content of the Three Datasets

Structured Dataset

Semi structured Dataset

Unstructured Dataset

ID observation

Type of source

Volume

Source

Page

Page pdf

Original text

Simplified text

Type of text

Year reference

Year of Source

Year

Last name

Date

First name

Day

Middle Name

Middle name

Month

Simplified name

Author

Position

Title

Organization

Subtitle

Location

Longitude (attributed)

Latitude (attributed)

Nationality

Capital

Country

Minerals

Education

Output

Technology

17 variables

19 variables

14 variables

316,672 observations

611,291 observations

506,138 observations

129,599 Persons

52,792 Firms

396,961 Articles

NOTES: “Organization” refers to either university or employer; “Education” refers to

school attended; “Year reference” refers to the year of the source. Observations

correspond to each mention of an individual in the sources (thus the same individual may

have multiple observations), to each entry for a particular firm, and (in the Unstructured

dataset) to recognizable text, linking date, title, and author with the text of an entry or

article. See below on the relational structure of these three datasets. Cell and record

counts are as of June 2025.

The digitized historical documents for each type of source were downloaded

through HathiTrust. The different structural characteristics of the sources required

different methods of information mining and data cleaning. The next three sections detail

the methods and workflows we utilized to extract, clean, and organize each type of

qualitative data, before turning to our validation protocol, linking methods, and results.

Structured Data

Structured data is common in published historical sources such as annual or occasional

lists of students, alumni, or association members. Figure 1 presents our workflow for

constructing a dataset from structured data. Extraction begins using OCR provided with

the original download of a digitized historical source from Hathitrust, or, if no OCR was

provided or was of particular poor quality, obtained using Adobe. The next step involves

segmentation of the text between observations, which in the individual's dataset means

differentiating between the units of information: individuals and the information

pertaining to them. We then classify and clean the information to be able to identify the

different categories present in each observation. Both segmentation and classification

used regular expressions in tidyverse (Ooms, 2019; Silge & Robinson, 2016; Wickham &

Wickham, 2017; Yarberry, 2021). A combination of OpenRefine (Miller & Vielfaure, 2022),

manual edits and the use of the Virtual International Authority File (VIAF) authority was

used to search for individuals and to choose the best name match if present in the

dataset. Geoparsing from locational data followed, before passing a visual examination

test where common mistakes were identified, before repeating the process (Petrova-

Antonova & Tancheva, 2020). The structured datasets were then validated and compiled

into the database, before conducting a general validation of the Structured dataset.

Figure 1: Sequence of Operations for Structured Data

Units as rows of text are the original form of extraction for the database. The

classification process then identifies the type of information within each unit through

regular expressions, but keeping a unique row ID (which allows us to trace back the data

mining process). Name parsing for individuals was elaborated through a simplification of

the name. Names appeared in the sources in several forms, from full names with titles to

abbreviations with last names and initials. Some of them had different orthographies

(such as Mc or Mac, or elimination of "von" or "de"), which further complicated the

parsing. We simplified the orthographies using Regex in stringr and the structure into last

name and initials to consolidate the variations.2

Like similar datasets focused on firms and individuals, all three of our datasets can

be described as unbalanced data panels with a long or stacked structure. The voluntary

reporting, the transformations of corporate names and government agencies, and the

continuous emergence of new actors (individuals, universities, and corporations, among

others) in the mining world makes the construction of a comprehensive dataset

impossible.3 Nevertheless, we have no reason to suspect that missing observations are

nonrandom, as incentives to report did not change during the period. Figure 2 shows the

temporal density of individuals in the database, reflecting the rapid growth of engineering

2 R scripts and csvs are be available at (solaresig, 2020/2024).

3 See some of the implications of missing data in panel analysis with similar datasets (individuals that self

report in waves and firms in the market) in (Baltagi & Chang, 1994; Baltagi & Song, 2006; Young & Johnson,

2015).

between the 1870s and 1930s. Our Structured Dataset contains 315,141 observations of 9

variables, corresponding to 129,599 unique individuals. It is difficult to know the gender

composition from the dataset, but we used the genderdata package in R to make

historical predictions of sex assigned at birth (Blevins & Mullen, 2015; Mihaljević et al.,

2019). We predicted the sex assigned at birth of 70,834 individuals, almost 55% of the

sample: from them, about 2.6% might have been female. The geographical distribution of

individuals is heavily concentrated in the North Atlantic countries, although there is a

significant presence of work locations in mining regions across all continents (see Figure

3). We can identify 242,629 geolocations on the individual-centered dataset,

corresponding to 23,644 distinct locations for 98,501 different individuals.

Figure 2: Total number of individuals in the Structured dataset

Figure 3: Geolocation of individuals in the Structured dataset

Semi structured Data

Firms in the mining sector constitute the focus of the second dataset, drawn from

repeated editions of corporate directories. Using these sources involves a combination of

structured and unstructured data. Our first experiment with managing this data was with

the Mexican Mining Manual, in which we used a greater amount of visual classification

and parsing. In contrast, extracting the information from the more substantial corporate

directories published in New York and London (MYB and AMM) between 1887 and 1923

used minimal visual classification. In these, the structured section of the data typically

included the company's name and list of directors, but the rest of the information was

largely unstructured and non-consistent across observations. These two sets of

information required the same steps as the structured data in the individual's dataset but

with a higher level of complexity to consistently establish the character and naming of

firms (Figure 4).

Figure 4: Sequence of Operations for Semi-Structured Data

After classifying and cleaning the structured section of the information, we searched for

names and locations already present in the dataset, and used regular expressions to

identify the names and locations of places, such as mines, mills, and headquarters, and

the recognition of output types and capital for the firms. As firms’ self-reporting segments

this information, further searches following these methods can be made in the future. The

Named Entity Recognition for firms is more complex than the cleaning of individual names

in the structured dataset. Name parsing for organizations is more challenging because

organizations may change their name, consolidate or become absorbed by others, or

simply register as a subsidiary of another firm on occasion (Song et al. 2020). Naming

authorities contain a small amount of information for these entities, as thousands of firms

appeared and disappeared relatively quickly in the mining sector. Consequently, name

parsing involved a combination of cleaning through regular expressions in R using the

stringr package, searching the Virtual International Authority File (VIAF) database, and the

use of OpenRefine (Bianchini et al., 2021; Schneider, n.d.).4 The next step was a simple

geoparsing based on the locational data product of the classification.

The Semi structured Dataset contains 611,291 observations of 12 variables,

corresponding to 52,792 unique organizations. Figure 5 shows the temporal density of

companies, and Figure 6 shows the 5,510 distinct locations present (Figure 6). Significant

4 See the file corpis.R and combo.R in our on-line appendix for the specific criteria.

clustering occurs in London, New York, and Boston where the headquarters of many firms

are listed.

Figure 5. Total number of organizations in the Semi structured dataset.

Figure 6: Geolocation of organizations in the Semi structured dataset

Unstructured Data

Unstructured qualitative data in historical sources present a number of challenges, but

recent developments in pattern recognition have allowed researchers to identify entities

in semi structured and unstructured data, with the help of automatic methods using

machine learning (Lafia et al., 2023; Liu et al., 2022). Our work with the Engineering and

Mining Journal (EMJ) to create the Unstructured Dataset faced the following problems:

a. Character recognition. Complex layouts on the digitized page, composed of

multiple columns, headers, pictures and sections, which often produce faulty and

incomplete character recognition. A picture might be interpreted as text or text as

an image, creating complex patterns in the resulting text layer (Blomqvist et al.,

2023).

b. Segmentation. Articles are pieces of information with internal coherence,

but the breaks between them do not follow repetitive patterns embedded in the

text. In other words, if lists in the alumni lists, membership publications, and

corporate directories had a recognizable structure to separate the pieces of

information, the journal (and most newsprint formatting) does not provide these.

Section and subsection headings may be inconsistent, and single articles often

cross column or page boundaries. This implies that, even if the pieces of text are

recognizable by the OCR engine, it cannot easily make the proper distinctions

between the pieces of information – the component elements of the journal

(Likforman-Sulem et al., 2007).

c. Classification. While vital information, such as location, names, and

employment, had a clear and repetitive structure in the other two datasets, the

unstructured nature of the journal implies that extracting this information might

face even greater challenges regarding entity recognition and disambiguation. We

therefore do not extract such information in specific fields, and the information

has to be acquired through ad hoc searches within the content of the “Text” field

of the resulting dataset (Bravo Balado et al., 2015).

d. Cleaning. Because classification of information is complex, it is also very

difficult to identify and remove mistakes, particularly in longer pieces of text. This

can be done solely in the structured parts of the article, such as author names

when they are present (Christen, 2012).

Image segmentation with a computer vision method was used to overcome these

limitations, although some remain in the final form of the database. Figure 7 outlines the

sequence and logic of the workflow process.

Figure 7: Sequence of Operations for Unstructured Data

The first step of the process was labeling. We constructed two initial datasets for

training and testing, and these were labeled by a research assistant following the

examples of a first test and using the process proposed by Label Studio (Lin et al., 2014;

Shen et al., 2021; Tkachenko et al., 2022; Welcome to Detectron2’s Documentation! —

Detectron2 0.6 Documentation, n.d.). As the labeling was made on multiple computers,

the IDs of the COCO annotations are inconsistent, so consolidation required homogenizing

the labeling IDs. We also double-checked the labeling tasks, eliminating ambiguities in the

images labeled to feed into the training process. We included only five categories in the

labeling (page number, date, title, author, text), excluding page headers, images, captions

for images, and tables, as their information is often difficult to extract.

After the labeling was complete, the dataset was randomly divided into training

and validation at a split ratio of 0.85.5 After a model of predictions was produced, weights

were used to predict the layout of the pages in all volumes of EMJ (1870-1930), totaling

84,287 pages. The prediction result was divided into pages and segments of the image, to

which the Tesseract OCR package was applied. In short, each piece of information on the

page has locational data (x and y coordinates) and associated text. The next step was to

5 We used a 0.85 ratio as suggested for the training of the layout parser model by Shannon Shen (Layout-

Parser/Examples/Customizing Layout Models with Label Studio Annotation/Customizing Layout Models with

Label Studio Annotation.Ipynb at Main · Layout-Parser/Layout-Parser · GitHub, n.d.).

assign a hierarchy based on the locational data on each page, to rebuild the pieces of

information that might have been separated (paragraphs of the same article in different

columns, or interrupted by an image, etc.). After this, a new text segmentation based on

regular expressions was made to minimize the errors in the articles. In this step, we added

the category “subtitle” to account for pieces of information that have internal coherence

between them, even if titles separate them. We were more concerned with false matches

(adding pieces of information with no relationship between them), and this step was very

restrictive logically.

The Unstructured Dataset comprises 506,138 observations across 19 variables,

corresponding to 396,961 unique articles spanning over 40 years of the journal. Figure 8

shows the density of articles in the database. As can be seen, the curated nature of trade

news makes the accumulated information smoother than individuals and corporations,

but a similar temporal growth pattern is evident. We can identify and geolocate 246,495

pieces in the journal dataset, corresponding to 2,887 unique locations (Figure 9).6 The

geographic distribution of places mentioned in articles reinforces the U.S. and European

orientation of the journal and the industry, but also illustrates its globally dispersed

interests.

Figure 8: Total number of articles in the Unstructured dataset.

Figure 9: Geolocation of articles in the Unstructured dataset

66 Using the Google API for geolocation, the ggmap package and the countrycode package, both in R, for

mapping; see (Arel-Bundock et al., 2018; Kahle & Wickham, 2013; South, 2011)

Validation

One of the main challenges in data linkage is to validate data quality in the linked datasets,

the product of problems in record transcription, digitization, and linking algorithms

(Antonie et al. 2020; Bailey et al. 2020; 2023; Wisselgren et al. 2014). One of the

advantages of linkage within a professional community, over a relatively short period of

time, is that the risks of false positives in linked records are greatly reduced, compared to

the linkage across census years, for example. Nevertheless, validation of the data was an

integral part of producing each of the three datasets.

For the information in the Structured Dataset, there are four levels of possible

errors: errors in the extraction process through the OCR; errors in the cleaning process;

errors in classification between different kinds of information; and errors in parsing, both

for individuals and organizations. We conducted three assessments of each type of

mistake present in the database, measuring the most common errors and the overall

accuracy of the methodology. We separated the text-centered dataset into five categories

according to the quantity of data. Then, a randomized representative sample at 95%

significance of the rows of the database was evaluated by a researcher, comparing them

with the original sources and making informed decisions about the mistakes in the

process. The results are in Table 2.

Table 2. Validation of Representative Samples of the Structured Dataset

ALUMNI

AIME

OVERALL

384

382

766

CAcc

97.70%

99.00%

99.30%

WAcc

Company/Organization

72.60%

89.70%

81.20%

Year Regist

80.00%

Location

92.70%

46.20%

Lastname

94.60%

86.60%

90.60%

Firstname

93.90%

Middlename

67.00%

Position

76.80%

95.60%

86.20%

Nationality

99.50%

Education

100.00%

Overall Accuracy

80.80%

94.00%

87.40%

Notes: CAcc (Character Accuracy Rate, or 100% minus CER), WAcc (Word Accuracy Rate)

Although the overall accuracy was 87.39%, the relatively lower scores in Company

name recognition, position, and middle name motivated a general revision of the

cleaning, classifying, and parsing steps for the complete dataset.

Verification of information in the Semi structured Dataset began with integrating

information for another round of cleaning and testing and looking for the same kind of

mistakes as above in the information extracted. We applied the analysis to the

consolidated data and the results are available in Table 3.

Table 3. Validation of Representative Samples of the Semi structured Dataset

MYB

AMM

MMM

OVERALL

409

384

368

1161

CAcc

98.6%

99.5%

99.8%

99.3%

WAcc

Company/Organization

43.8%

73.8%

92.7%

69.2%

Capital

99.3%

98.8%

96.8%

98.3%

Year Regist

99.3%

98.3%

99.8%

99.1%

Location

75.8%

94.4%

89.0%

86.1%

Last name

88.8%

83.1%

74.1%

82.2%

First name

94.1%

97.6%

94.6%

95.4%

Middle name

91.7%

87.8%

76.5%

85.6%

Role

96.6%

98.3%

96.3%

97.1%

Overall

86.2%

91.5%

90.0%

89.1%

Notes: CAcc (Character Accuracy Rate), WAcc (Word Accuracy Rate). MYB = Mining Year

Book; AMM = American Mining Manual; MMM = Mexican Mining Manual.

The overall accuracy was 89.1%. However, as with the structured data, the

disparity of results with the original sources (especially regarding company names)

motivated a general revision of the cleaning, classifying, and parsing steps in the complete

dataset. Again, a randomized representative sample was completed at a 95% confidence

level. The accuracy results for the new datasets, for both individuals and firms, are

presented in Table 4.7

Table 4. Second validation of representative sample (WAcc)

Item

Individuals

Dataset Firms Dataset Average

Last name

98.02%

86.36%

94.01%

7 Se the files in the replicability package at (solaresig, 2020/2024)

Middle name

98.02%

93.18%

96.35%

First name

97.62%

93.18%

96.09%

Simplified name

96.03%

81.06%

90.89%

Education

100.00%

99.24%

99.74%

Organization

100.00%

84.09%

94.53%

Location

91.67%

98.48%

94.01%

Average

97.34%

90.80%

95.09%

We tested the matching process by measuring the possibilities of spurious matches

between datasets which include individuals between the Structured dataset and the Semi

structured dataset. A spurious match could be caused by: convergent errors in the OCR, in

which the errors randomly match on a third, not real, entity; convergent errors in the

cleaning and classification process; or coincidence in the last name, the first letter of the

first name, and the first letter of the middle name. We generated a randomized

representative sample at 95% significance for matching information through a simplified

name. Then a researcher validated the parsing as true or false positives by looking

manually at the sources and the contextual information. From all the above causes, false

positives represented just 1.83% of the test sample (n=382). After this process, a visual

examination of the most prominent 2,000 organizations in the database was conducted

and manually corrected for the errors described above. Most of the corrections involved

not the misidentification of the company but a spelling mistake that was carried over from

the OCR phase. No further accuracy assessment has been made on the manually corrected

datasets.

Validation of the Unstructured data required new levels of evaluation as the errors

could be different: errors in image segmentation; misclassification of text into articles;

misclassification of labels in the page; OCR errors; and errors in the reference system. As

Table 5 reports, higher accuracy came in detecting boundaries in text, followed by date

and title at much lower levels. This can be an expected result, as the larger objects are text

and the smaller objects are date and author. For the object detection model, the usual

metric used to assess validity is the Average Precision metric (AP). The metric summarizes

the area under the precision-recall curve, or it can also be defined as the weighted mean

of that relationship.8 The accuracy metrics greatly improved after applying regular

8 While precision is the coefficient between the True Positives (TP) as a proportion of all possible positives

detected (true positives and false positives), recall measures the coefficient between TP and all possible

positives (TP and False Negatives). Usually there is an inverse relationship between Precision and Recall, and

the precision-recall curve measures that relationship, while the AP metric summarizes that curve.

expressions. The classification and cleaning process focused on the structured data

provided after the segmentation: specific dates, correcting the page references, and

cleaning the authors' names. We conducted NER based on the previous two datasets,

extracting locations and corporate names from the unstructured text. Finally, a human

validated a representative sample of the information and the comprehension of the

unstructured sections to prove the consistency in segmenting the paragraphs in articles

and news reports (coherent pieces of information), along with CAcc and WAcc measures.

Table 6 shows the results of this human validation.

Table 5. Results of the Training

The Engineering History Project Database: Creating and Linking Datasets from Structured, Semi-Structured, and Unstructured Historical Sources PDF Free Download

The Engineering History Project Database: Creating and Linking Datasets from Structured, Semi-Structured, and Unstructured Historical Sources PDF free Download. Think more deeply and widely.

Uploaded by Mary Allen on 3/20/2026

/35

100%