1 Introduction


Welcome to the notebook Data cleansing: maestros.xlsx!

This notebook sets an example on how data should be cleansed using R before being published on the IHR Data Base.

Data cleansing is a tedious task that involves a lot of trial and error iterations, so in order to ease the detection and rectification of mistakes, scripts will often be broken down into smaller chunks with a lot of object reassignment. This is specially helpful to prevent having to rerun long pieces of code.

The criteria followed throughout the notebook is gathered in the Metodologia IHR adaptació conjunts dades as on the creation date of this document. Nevertheless, each data set should be cleansed separately and according to its variables and context.

1.1 Description

The data set “Maestros” contains 565219 observations and 6 variables, structured as follows:

  • APELLIDOS: string with the first and second family name.
  • NOMBRE: string with the personal name.
  • LEGAJO: string with the folder number.
  • Nº EXPEDIENTE: string with the record number.
  • TIPO EXPEDIENTE: string with the kind of record.
  • ESPECIALIDAD: string with the job specialization.

1.2 Working environment

For the purpose of this project, the following libraries are required:

  • “readxl” & “writexl”: to import and export excel files, in this case ‘maestros.xlsx’ and its cleansed subsets.
  • “tidyverse”: to manipulate data.
  • “stringi”: to transform strings.
  • “stringdist”: to operate according to string distances.
# Load of the required packages.
library(readxl)
library(writexl)
library(tidyverse)
library(stringi)
library(stringdist)

# Importation of the data.
data <- read_excel('maestros.xlsx')

# Visualization of the data (from now on only the head of the data).
head(data)
## # A tibble: 6 x 6
##   APELLIDOS    NOMBRE      LEGAJO `Nº EXPEDIENTE` `TIPO EXPEDIENTE` ESPECIALIDAD
##   <chr>        <chr>       <chr>  <chr>           <chr>             <chr>       
## 1 Porras Caba… Mª del Roc… 18746  12              Personal Docente  Maestro/a   
## 2 Dopacio Gra… Mª Argenti… 15798  143             Depuración Maest… Maestro/a   
## 3 Montoro Cas… José Luis   20107  105             ?                 Aparejador  
## 4 Ruiz Catage… María Susa… 21026  82              Personal Docente  Maestro/a   
## 5 Pérez García Mª de los … 7934   21              Personal Docente  Maestro/a   
## 6 A. Rios      Mª de los … 118    63              Depuración: Maes… <NA>
# Visualization of the structure of the data.
str(data)
## tibble [565,219 × 6] (S3: tbl_df/tbl/data.frame)
##  $ APELLIDOS      : chr [1:565219] "Porras Cabañero" "Dopacio Grana" "Montoro Castilla" "Ruiz Catagena" ...
##  $ NOMBRE         : chr [1:565219] "Mª del Rocío" "Mª Argentina Concepción" "José Luis" "María Susana" ...
##  $ LEGAJO         : chr [1:565219] "18746" "15798" "20107" "21026" ...
##  $ Nº EXPEDIENTE  : chr [1:565219] "12" "143" "105" "82" ...
##  $ TIPO EXPEDIENTE: chr [1:565219] "Personal Docente" "Depuración Maestros" "?" "Personal Docente" ...
##  $ ESPECIALIDAD   : chr [1:565219] "Maestro/a" "Maestro/a" "Aparejador" "Maestro/a" ...

In order to ease the work-flow, each column’s name is lower-cased and reduced to its first three letters, as in: APELLIDOS -> ape.

# Simplification of the names of the columns.
colnames(data) <- c("ape", "nom", "leg", "num", "tip", "esp")

1.3 Nomenclature

Imported data are named fully subjectively, as in data = data set “maestros.xlsx”.

All other objects are named according to the following criteria and order:

  1. Type of object: v = vector, m = matrix, df= data frame, etc.
  2. Variable or column’s name: ape, nom, leg, etc. although longer names might also be used.
  3. Transformation’s…
    • number (only for transformation vectors): 00, 01, 05, etc.
    • information, if any: dig.row = rows with digits.

Examples:

  • v.ape00 = variable “ape”’s vector without transformations (exact copy of the variable)
  • v.ape01 = variable “ape”’s vector after the first transformation
  • v.nom.mas.row = variable “nom”’s vector with the rows containing masculine ordinal indicators
  • m.nom.fem = matrix with variable “nom”’s rows that have feminine ordinal indicators and the content of these cells

To avoid any sort of ambiguity, all column names (including those corresponding to different data frames) should differ from each other. This is made possible due to the reduced amount of variables involved and makes object naming much easier.


2 Variables


Each variable requires a different cleansing process, so they should first be copied into string vectors and processed separately and afterwards reassembled into one clean data frame.

2.1 Apellidos

Spanish family names should contain no digits nor punctuation marks besides apostrophes, hyphens and dots. Extra characters and spaces, such as non-printable characters or padding spaces should be deleted and fixed if necessary.

# Creation of a vector containing all the strings in column "ape".
v.ape00 <- data$ape

2.1.1 Digits

All digits should be removed. However, doing so might carry new mistakes such as broken words, incomplete words, etc.

# Visualization of all the rows containing DIGits.
v.ape.dig.row <- grep("\\d", v.ape00)
v.ape.dig.content <- v.ape00[v.ape.dig.row]
m.ape.dig <- cbind(v.ape.dig.row, v.ape.dig.content)
head(m.ape.dig)
##      v.ape.dig.row v.ape.dig.content         
## [1,] "87254"       "Sanchez de la Pe4ña"     
## [2,] "122114"      "Martínez Martínez\r\r\n1"
## [3,] "122651"      "Uri8be y Salmeron"       
## [4,] "152574"      "Muñoz G9onzález"         
## [5,] "159306"      "0lea Pérez"              
## [6,] "181956"      "Pimentel Vázq2uez"

The deletion of all digits is a great solution for all the cases except for v.ape00[159306] = "0lea Pérez", where the “0” should be transformed to “O”.

Although this transformation can be subject to debate, “Olea” is a common Spanish family name and, since the key “0” is very close to the key “O”, they could easily have been mistyped. From now on and for practical reasons, there won’t be any other explanations on how particular cases are corrected.

# Correction of all the mistakes requiring a particular solution.
v.ape00[159306] = "Olea Pérez"

# Deletion of all the remaining digits.
v.ape01 <- gsub("\\d+", "", v.ape00)

All digits have been removed or fixed, except for one case that still needs fixing:
v.ape01[122114] = "Martínez Martínez\r\r\n", which should have the control characters removed.

2.1.2 Control characters

All control characters should be removed. As before, doing so might carry new mistakes such as broken words, incomplete words, etc.

# Visualization of all the rows containing CONtrol characters.
v.ape.con.row <- grep("[[:cntrl:]]", v.ape01)
v.ape.con.content <- v.ape01[v.ape.con.row]
m.ape.con <- cbind(v.ape.con.row, v.ape.con.content)
head(m.ape.con)
##      v.ape.con.row
## [1,] "2836"       
## [2,] "4029"       
## [3,] "4159"       
## [4,] "13673"      
## [5,] "19151"      
## [6,] "19152"      
##      v.ape.con.content                                                 
## [1,] "Adán y Martínez\r\r\nAdán y Martínez"                            
## [2,] "Altamirano y Martín-Montijano\r\r\nAltamirano y Martín-Montijano"
## [3,] "Andres  Jimenez\r\r\nimenez"                                     
## [4,] "Juanals Sitjas\r\r\nJuanals Sitjas"                              
## [5,] "García Torres\r\r\nGarcia Torres"                                
## [6,] "García Torres\r\r\nGarcía Torres\r\r\nGarcía Torres"

All observations can be fixed by deleting the control characters and its following characters, except for some cases that require a particular solution.

# Correction of all the mistakes requiring a particular solution.
v.ape01[53297] = "Ricou Muñoz"
v.ape01[153256] = "Muñoz y Muñoz"
v.ape01[278881] = "Campoy Martínez"
v.ape01[323379] = "Yániz Martínez"
v.ape01[331439] = "León Montoya"
v.ape01[469864] = "Jimenez Urbano"
v.ape01[471983] = "Jiménez García"  
v.ape01[514837] = "Enríquez y Noure"
v.ape01[518783] = "Martínez Pérez"  
v.ape01[530785] = "Mondragón Elorza"  
v.ape01[541141] = "Arnaldes Millán"

# Deletion of all the remaining control characters and their following characters.
v.ape02 <- gsub("[[:cntrl:]]+.*", "", v.ape01)

All control characters have been properly removed, but there are still many cases requiring a space there where the characters were.

2.1.3 Ordinal indicators

All ordinal indicators should be replaced by their intended written meaning, as in “Mª” -> “María” or “Antº” -> “Antonio”, etc. However, sometimes the intention might be ambiguous and can be replaced by a dot, like “Fº” = “Fernando/Francisco/…?” -> “F.”.

# Visualization of all the rows containing MASculine ordinal indicators.
v.ape.mas.row <- grep("º", v.ape02)
v.ape.mas.content <- v.ape02[v.ape.mas.row]
m.ape.mas <- cbind(v.ape.mas.row, v.ape.mas.content)
head(m.ape.mas)
##      v.ape.mas.row v.ape.mas.content

There are no masculine ordinal indicators.

# Visualization of all the rows containing FEMinine ordinal indicators.
v.ape.fem.row <- grep("ª", v.ape02)
v.ape.fem.content <- v.ape02[v.ape.fem.row]
m.ape.fem <- cbind(v.ape.fem.row, v.ape.fem.content)
head(m.ape.fem)
##      v.ape.fem.row v.ape.fem.content           
## [1,] "224145"      "Arraiz Santa Mª"           
## [2,] "240768"      "Arroyo PérezMª del Rosario"
## [3,] "492341"      "Mª Josefa del"
# Replacement of all the "ª"s by "arías"s.
v.ape03 <- gsub("ª+", "aría", v.ape02)

All cases have been properly replaced by “aría”, but there is still one mistake that needs fixing:
v.ape03[240768] = "Arroyo PérezMaría del Rosario" needs a space between “PérezMaría”.

2.1.4 Punctuation marks

2.1.4.1 Apostrophes

All apostrophes should be surrounded by alphabetic characters.

# Visualization of all the rows containing APOstrophes.
v.ape.apo.row <- grep("'", v.ape03)
v.ape.apo.content <- v.ape03[v.ape.apo.row]
m.ape.apo <- cbind(v.ape.apo.row, v.ape.apo.content)
head(m.ape.apo)
##      v.ape.apo.row v.ape.apo.content     
## [1,] "8154"        "D'Angelo Muñoz"      
## [2,] "55921"       "Roca D'Ocón"         
## [3,] "101459"      "Malcampo O' Farrell" 
## [4,] "159422"      "O'Dogherty López"    
## [5,] "159423"      "O'Dogherty Sánchez"  
## [6,] "159424"      "O'Farrell Montesinos"

There are cases of words to the right of an apostrophe that are away from it. They should be brought back to the apostrophe.

# Attachment to the apostrophe of words away from it.
v.ape04 <- gsub("(')( +)", "\\1", v.ape03)

All apostrophes are surrounded by alphabetic characters and make sense now. However, v.ape04[378229, 528897, 528898] must have the word after the apostrophe upper-cased.

2.1.4.2 Dots

All dots should be preceded by an isolated upper-case letter and followed by a space.

# Visualizations of all the rows containing DOTs.
v.ape.dot.row <- grep("[.]", v.ape04)
v.ape.dot.content <- v.ape04[v.ape.dot.row]
m.ape.dot <- cbind(v.ape.dot.row, v.ape.dot.content)
head(m.ape.dot)
##      v.ape.dot.row v.ape.dot.content         
## [1,] "6"           "A. Rios"                 
## [2,] "4525"        "Aranega J.  Del Castillo"
## [3,] "7543"        "Casals y C. Jovellanos"  
## [4,] "13629"       "Heredia R. de Castaño"   
## [5,] "13690"       "V.Ventura Astals"        
## [6,] "39122"       "Ibañez D. Del Cotero"

All observations are either correct or can be fixed by substituting the dot with a space, except for four cases that require a special solution.

# Correction of all the mistakes requiring a particular solution.
v.ape04[13690] = "V. Ventura Astals"  
v.ape04[187413] = "De la O Fernandez"
v.ape04[203995] = "S. Elias"
v.ape04[278753] = "Bohigas y Domingo"  

# Substitution of all the dots between words with a space.
v.ape05 <- gsub("([a-z])([.])([A-Z])", "\\1\\ \\3", v.ape04)

2.1.4.3 Hyphens

All hyphens should be surrounded by alphabetic characters.

# Visualization of all the rows containing HYPhens.
v.ape.hyp.row <- grep("-", v.ape05)
v.ape.hyp.content <- v.ape05[v.ape.hyp.row]
m.ape.hyp <- cbind(v.ape.hyp.row, v.ape.hyp.content)
head(m.ape.hyp)
##      v.ape.hyp.row v.ape.hyp.content             
## [1,] "162"         "Abad Díaz-Parreño"           
## [2,] "595"         "Abad-Conde Sevilla"          
## [3,] "1014"        "Abascal Martín-Artajo"       
## [4,] "1097"        "Abdel-Lah Abdel-Lah"         
## [5,] "1098"        "Abdel-Lah Meki"              
## [6,] "1572"        "Abengózar y Fernández-Arroyo"

Most cases are correct, but some others are surrounded by at least one space or seem to be a mistake.

# Correction of all the mistakes requiring a particular solution.
v.ape05[157784] = "Núñez Puga"
v.ape05[219077] = "Borbonoes Llopis"
v.ape05[238966] = "Aldás Oliver"    
v.ape05[249612] = "Sels Guibas"                            
v.ape05[249613] = "Senderos y Villanueva" 
v.ape05[465921] = "Comella"

# Deletion of all the spaces to the right of a hyphen.
v.ape06 <- gsub("(-)( +)", "\\1", v.ape05)

# Deletion of all the spaces to the left of a hyphen.
v.ape07 <- gsub("( +)(-)", "\\2", v.ape06)

All hyphens are now surrounded by alphabetic characters. However, there are still family names with incorrect casing.

2.1.4.4 Question marks

All question marks should be deleted.

# Visualization of all the rows containing QUEstion marks.
v.ape.que.row <- grep("[?]", v.ape07)
v.ape.que.content <- v.ape07[v.ape.que.row]
m.ape.que <- cbind(v.ape.que.row, v.ape.que.content)
head(m.ape.que)
##      v.ape.que.row v.ape.que.content  
## [1,] "3491"        "Aguirre ?"        
## [2,] "3915"        "Alcázar y de la ?"
## [3,] "8527"        "Egaña y ?"        
## [4,] "9751"        "Izene ?"          
## [5,] "13734"       "García ?"         
## [6,] "13735"       "García ?"

All observations can be fixed by deleting the question mark, except for two cases that require a special solution.

v.ape07[171954] ="Parra y Gismero" 
v.ape07[237104] = "Sancho y Llodrá" 

# Deletion of all the remaining question marks.
v.ape08 <- gsub("[?]+", "", v.ape07)

All question marks have been properly removed but there are still many cases with extra spaces to be removed.

2.1.4.5 Others

Any other kind of punctuation mark should be removed.

# Visualizations of all the rows containing other kind of PUNctuation marks.
v.ape.pun.row <- grep("(?!['.?-])[[:punct:]]", v.ape08, perl = TRUE)
v.ape.pun.content <- v.ape08[v.ape.pun.row]
m.ape.pun <- cbind(v.ape.pun.row, v.ape.pun.content)
head(m.ape.pun)
##      v.ape.pun.row v.ape.pun.content                 
## [1,] "15379"       "Palau-Ribes O`Callaghan"         
## [2,] "25295"       "Pérez Cara,és"                   
## [3,] "66331"       "Jaime (sic) Sorribes"            
## [4,] "70910"       "Jofré de Villegas, Cantalapiedra"
## [5,] "141115"      "Monja Fajardo, de la"            
## [6,] "167400"      "Bobillo D`istria"

All cases must be fixed and require a particular solution.

# Correction of all the mistakes requiring a particular solution.
v.ape08[15379] = "Palau-Ribes O'Callaghan" 
v.ape08[25295] = "Pérez Caramés"                   
v.ape08[66331] = "Jaime Sorribes"            
v.ape08[70910] = "Jofré de Villegas Cantalapiedra"
v.ape08[141115] = "De la Monja Fajardo"            
v.ape08[167400] = "Bobillo d'Istria"                
v.ape08[177813] = "Pérez Fernández"    
v.ape08[199864] = "Riera Palmer"                   
v.ape08[214135] = "Ruano Castro Nuño"              
v.ape08[234077] = "Sanchez Mendez"                 
v.ape08[237498] = "Sanmartín Peleteiro"            
v.ape08[240377] = "Arriola Pedroarena"             
v.ape08[240855] = "Arroyo y Arroyo"                
v.ape08[241137] = "Artero Pérez"                  
v.ape08[265385] = "Torres Sánchez-Blanco"           
v.ape08[271065] = "Vargas Alvarez-Castellanos"      
v.ape08[278336] = "Ayuso Pérez"                    
v.ape08[365418] = "López d'Hers"                    
v.ape08[365419] = "López d'Hers"                    
v.ape08[372201] = "Coig-O'Donnel y Durán"           
v.ape08[372202] = "Coig-O'Donnell Bertrán de Lis"   
v.ape08[443419] = "Goya Busquets"                  
v.ape08[448128] = "Guivernáu Pla"                   
v.ape08[451462] = "Amigó Tuero-O'Donnell"           
v.ape08[498351] = "Manich Oliva"                   
v.ape08[503044] = "Marrero O'Shanahan"              
v.ape08[563384] = "O'Shanahan Roca" 

However, although accents are not considered punctuation marks, they are sometimes confused as such by the typewriter.

# Visualization of all the rows containing isolated ACCents.
v.ape.acc.row <- grep("[`´^¨]", v.ape08, perl = TRUE)
v.ape.acc.content <- v.ape08[v.ape.acc.row]
m.ape.acc <- cbind(v.ape.acc.row, v.ape.acc.content)
head(m.ape.acc)
##      v.ape.acc.row v.ape.acc.content             
## [1,] "42041"       "Puente O´Connor"             
## [2,] "51845"       "Ribalta O´moore"             
## [3,] "63123"       "Rodríguez L´pez"             
## [4,] "104473"      "Solis O´Connor"              
## [5,] "106158"      "Marqués d´Oliveira y Vicandi"
## [6,] "107782"      "Beardo A´Perl"

Almost all cases should be apostrophes.

# Correction of all the mistakes requiring a particular solution.
v.ape08[63123] = "Rodríguez López"             
v.ape08[107782] = "Beardo Perl" 
v.ape08[129128] = "Vázquez Rodríguez"  
v.ape08[189531] = "Portilla Fernández"              
v.ape08[266543] = "Tristancho y Sánchez"
v.ape08[426321] = "Ramírez Galán"

# Transformation of all isolated accents to apostrophes.
v.ape09 <- gsub("´", "'", v.ape08)

All unwanted symbols have been removed or fixed, but there are still many inconsistencies regarding casing and spacing. Moreover, there are words still attached together.

2.1.5 Casing

According to the Spanish grammar rules, all words in a family name should be capitalized except for stopwords (excluding isolated articles). See Diccionario panhispánico de dudas - Mayúsculas, 4.3.

All words containing internal upper-case letters should be fixed. Whereas apostrophes are considered part of a word, hyphens are not.

# Visualization of all the rows containing words with internal UPPer-case letters.
v.ape.upp.row <- grep("([[:alpha:]])([[:upper:]])", v.ape09)
v.ape.upp.content <- v.ape09[v.ape.upp.row]
m.ape.upp <- cbind(v.ape.upp.row, v.ape.upp.content)
head(m.ape.upp)
##      v.ape.upp.row v.ape.upp.content    
## [1,] "1895"        "AbriCáceres"        
## [2,] "1922"        "Abril  y ClaRamónte"
## [3,] "2769"        "AdroerIglesias"     
## [4,] "3457"        "AguilarAlmudevar"   
## [5,] "3458"        "AguilarAlvarez"     
## [6,] "3459"        "AguilarAnao"

Most observations can be fixed by adding a space between the upper-case letters and their preceding lower-case letter, since they are just two attached names. However, some observations require a particular solution.

# Correction of all the mistakes requiring a particular solution.
v.ape09[1922] = "Abril y Claramónte" 
v.ape09[16164] = "García Sánchez"
v.ape09[16165] = "García Sánchez"
v.ape09[27425] = "Pérez Hernández"
v.ape09[46194] = "Grifols Casas"
v.ape09[50146] = "Gutierrez Llamas"
v.ape09[50147] = "Gutierrez Perez"
v.ape09[50407] = "Gutierrez Sancho" 
v.ape09[51314] = "Resano Velazquez" 
v.ape09[62346] = "Rodríguez Chamochin" 
v.ape09[66258] = "Iñigo y Leal" 
v.ape09[70340] = "Rossello y Crespi"  
v.ape09[70676] = "Rua-Figuero Rodriguez"
v.ape09[71642] = "Juliá Suardíaz" 
v.ape09[73596] = "Lahera Claramónte" 
v.ape09[73757] = "Lampérez Cajal"                       
v.ape09[73758] = "Lampérez Navarro" 
v.ape09[81260] = "Amado Rey"   
v.ape09[92241] = "Lorenzo Ferrin"
v.ape09[92638] = "López Sáinz-Rusinés"
v.ape09[94229] = "Loscertales Laguna"    
v.ape09[102412] = "Sitjas Puig"  
v.ape09[102971] = "Sorevias Tresserras"
v.ape09[105391] = "Marijuan de Domingo" 
v.ape09[112186] = "Tapias Trapero"  
v.ape09[112499] = "Teixidor Pijuan"
v.ape09[114037] = "Martín-Caro y Caro"
v.ape09[127492] = "Matabacas Borras"
v.ape09[136869] = "Merino Martín"  
v.ape09[136870] = "Merino MartÍn" 
v.ape09[137320] = "Mestras Perpinya"                     
v.ape09[137321] = "Mestras Perpinya"                     
v.ape09[137322] = "Mestras Perpiña"  
v.ape09[138415] = "Minagorre Garcia"                     
v.ape09[146450] = "Moro Candelás" 
v.ape09[153191] = "Muñoz Catena" 
v.ape09[159765] = "Olalla López"
v.ape09[161334] = "Orejudo y Agudíez"                    
v.ape09[161335] = "Orejudo y Agudíez"  
v.ape09[165336] = "Palazuelos Cagigas"   
v.ape09[171774] = "Parodi y Suardíaz"
v.ape09[180270] = "Peribáñez Gil"                        
v.ape09[180271] = "Peribáñez Lizama"                     
v.ape09[180272] = "Peribáñez Rubio"                      
v.ape09[180273] = "Peribáñez Sánchez"                    
v.ape09[180274] = "Peribáñez y Caveda"                   
v.ape09[180275] = "Perramón Bonafont"                    
v.ape09[180276] = "Perramón Gajú"                        
v.ape09[180277] = "Perramón Galvet"                      
v.ape09[180278] = "Perramón Presas"                      
v.ape09[180279] = "Perramón Soler" 
v.ape09[180416] = "Perlinés Calvo"                       
v.ape09[180417] = "Perlinés Carretero"                   
v.ape09[180418] = "Perlinés Díaz"                        
v.ape09[180419] = "Perlinés Martín"                      
v.ape09[180420] = "Perlinés Martín"
v.ape09[180692] = "Pesudo Claramónte"                    
v.ape09[180791] = "Pérez Rupérez"  
v.ape09[181593] = "Pinés Espadas"                        
v.ape09[181594] = "Pinés Espadas"                        
v.ape09[181595] = "Pinés Espadas"                        
v.ape09[181596] = "Pinés Espadas"                        
v.ape09[181597] = "Pinés y Espadas"                      
v.ape09[181841] = "Piernagorda Castro" 
v.ape09[182138] = "Pinela Gil"  
v.ape09[183708] = "Arenas Suardíaz" 
v.ape09[192995] = "Ques Reinés" 
v.ape09[193264] = "Quilez Casulleras" 
v.ape09[198106] = "Picallos Rodríguez"                   
v.ape09[198215] = "Resinés Llorente"                     
v.ape09[198216] = "Resinés Tolosana"                     
v.ape09[198217] = "Resinés y Díez"                       
v.ape09[198218] = "Resinés y Plaza" 
v.ape09[199706] = "Rico Linares"  
v.ape09[214411] = "Rugarcía Baso"
v.ape09[215349] = "Ruigómez García"                      
v.ape09[215350] = "Ruigómez Guerra"                      
v.ape09[215351] = "Ruigómez López"                       
v.ape09[215352] = "Ruigómez y Angulo"                    
v.ape09[215353] = "Ruilópez Peracho"                     
v.ape09[215354] = "Ruipérez Díez"                        
v.ape09[215355] = "Ruipérez García"                      
v.ape09[215356] = "Ruipérez Gutierrez"                   
v.ape09[215357] = "Ruipérez Gómez"                       
v.ape09[215358] = "Ruipérez Haro"                        
v.ape09[215359] = "Ruipérez Morant"                      
v.ape09[215360] = "Ruipérez Parreño"                     
v.ape09[215361] = "Ruipérez Picazo"                      
v.ape09[215362] = "Ruipérez Puayo"                       
v.ape09[215363] = "Ruipérez Pérez"                       
v.ape09[215364] = "Ruipérez Pérez"                       
v.ape09[215365] = "Ruipérez Pérez"                       
v.ape09[215366] = "Ruipérez Rodríguez"                   
v.ape09[215367] = "Ruipérez Rubal"                       
v.ape09[215368] = "Ruipérez Ruipérez"                    
v.ape09[215369] = "Ruipérez Sánchez"                     
v.ape09[215370] = "Ruipérez Trobajo"                     
v.ape09[215371] = "Ruipérez de la Osa"                   
v.ape09[215372] = "Ruipérez y Alfaro"                    
v.ape09[215373] = "Ruipérez y Pérez"                     
v.ape09[218671] = "Blanco Miguelon" 
v.ape09[239026] = "Arruiz Baraibar"                      
v.ape09[239027] = "Arruiz Muneta"                        
v.ape09[239028] = "Arruiz Pérez" 
v.ape09[261151] = "Cañete Gómez de Aranda"  
v.ape09[270144] = "Vallejo y Pinazo" 
v.ape09[277932] = "Ayala Peribáñez" 
v.ape09[278640] = "Barceló Cuenca"   
v.ape09[283767] = "Torres de la Torre" 
v.ape09[298723] = "Castells y Nat"
v.ape09[310820] = "Galdós Sainz"  
v.ape09[328906] = "Calderon Garcia"                      
v.ape09[328907] = "Calderon Garzon" 
v.ape09[333318] = "Aguilera y Gabaldón" 
v.ape09[361528] = "Blanco Valdivieso" 
v.ape09[391489] = "Ravell Mestres" 
v.ape09[414327] = "Cámara Rupérez"                       
v.ape09[414328] = "Cámara Rupérez" 
v.ape09[419073] = "Garcia Mugica" 
v.ape09[433183] = "Giral Mangas"                         
v.ape09[438252] = "Casaramóna Masramón"  
v.ape09[433183] = "Giral Mangas"
v.ape09[442138] = "Castillejos Cifuentes" 
v.ape09[451084] = "Alzueta y Mendíazabal" 
v.ape09[469172] = "Izquierdo Díaz" 
v.ape09[490905] = "Debernardi Santos" 
v.ape09[497594] = "Díaz de Lope Díaz Lardíez" 
v.ape09[498501] = "Diego y Agudíez" 
v.ape09[502879] = "Marquez Garrido"  
v.ape09[517022] = "Olmos Jiménez"  
v.ape09[517087] = "Escriba Bordas" 
v.ape09[522662] = "Felgueroso Suardíaz" 
v.ape09[552026] = "Galán Ruilópez"                       
v.ape09[553528] = "Gallego Rupérez" 
v.ape09[560689] = "Lorós López"   

# Addition of a space between all upper-case letters and their preceding lower-case letter.
v.ape10 <- gsub("([a-z])([[:upper:]])", "\\1\\ \\2", v.ape09)

All words starting with lower-case letters should be fixed.

# Visualization of all the rows containing words starting with LOWer-case letters.
v.ape.low.row <- grep("\\<[[:lower:]]", v.ape10)
v.ape.low.content <- v.ape10[v.ape.low.row]
m.ape.low <- cbind(v.ape.low.row, v.ape.low.content)
head(m.ape.low)
##      v.ape.low.row v.ape.low.content            
## [1,] "137"         "Abad Conde y Sevilla"       
## [2,] "266"         "Abad Jaime de Aragón y Ríos"
## [3,] "472"         "Abad Sánchez de Toledo"     
## [4,] "506"         "Abad barrasús"              
## [5,] "507"         "Abad de Castro"             
## [6,] "508"         "Abad de Cela"
# Transformation of all word-beginnings to upper-case letters.
v.ape11 <- str_to_title(v.ape10)

# Transformation of all word-beginnings after an apostrophe to upper-case letters.
v.ape12 <- gsub("(')(.?)", "\\1\\U\\2", v.ape11, perl = TRUE)

All words are now capitalized, but the previously mentioned stopwords need to be lowered now.

# Lower-casing of all the articles and prepositions we might expect (except for unaccompanied articles and apostrophed articles or contractions).
stopwords1 <- c("( de )|( du )|( d')|( del )|( di )|( do )|( da )|( dos )|( das )")
stopwords2 <- c("( de l')|( de la )|( de los )|( de las )|( della )|( dalla )|( dell')|( von )|( van )")
stopwords3 <- c("( y )|( i )|( e )")

transformation1 <- c("\\L\\1\\2\\3\\4\\5\\6\\7\\8\\9")
transformation2 <- c("\\L\\1\\2\\3\\4\\5\\6\\7\\8\\9")
transformation3 <- c("\\L\\1\\2\\3")

v.ape13 <- gsub((stopwords1), transformation1, v.ape12, ignore.case = TRUE, perl = TRUE)
v.ape14 <- gsub((stopwords2), transformation2, v.ape13, ignore.case = TRUE, perl = TRUE)
v.ape15 <- gsub((stopwords3), transformation3, v.ape14, ignore.case = TRUE, perl = TRUE)
v.ape16 <- gsub("(\\<De La )", "De la ", v.ape15)
v.ape17 <- gsub("(\\<De Los )", "De los ", v.ape16)
v.ape18 <- gsub("(\\<De Las )", "De las ", v.ape17)

All Spanish conjunctions “o” and Portuguese masculine singular articles “o” should be lower-cased.

# Visualization of all the rows containing " O "s. 
v.ape.ooo.row <- grep(" O ", v.ape18)
v.ape.ooo.content <- v.ape18[v.ape.ooo.row]
m.ape.ooo <- cbind(v.ape.ooo.row, v.ape.ooo.content)
head(m.ape.ooo)
##      v.ape.ooo.row v.ape.ooo.content        
## [1,] "13614"       "Heras Treceño O Terceño"
## [2,] "159296"      "De la O y Díaz"         
## [3,] "159297"      "De la O y Díaz"         
## [4,] "187413"      "De la O Fernandez"      
## [5,] "438595"      "Casanovas O Kely"       
## [6,] "438596"      "Casanovas O Kely"
# Correction of all the mistakes requiring a particular solution.
v.ape18[13614] = "Heras Treceño"
v.ape18[438595] = "Casanovas O'Kelly"
v.ape18[438596] = "Casanovas O'Kelly" 

All words are properly cased now. However, there are still many improper spacings.

2.1.6 Spacing

All extra spaces should be removed. On the one hand, 2 or more consecutive spaces should be coerced into one.

# Visualization of all the rows containing 2 or more consecutive SPAces.
v.ape.spa.row <- grep("\\s{2,}", v.ape18)
v.ape.spa.content <- v.ape18[v.ape.spa.row]
m.ape.spa <- cbind(v.ape.spa.row, v.ape.spa.content)
head(m.ape.spa)
##      v.ape.spa.row v.ape.spa.content         
## [1,] "1899"        "Abril  Carrión"          
## [2,] "1900"        "Abril  Crespo"           
## [3,] "1901"        "Abril  Díez"             
## [4,] "1902"        "Abril  Escusa"           
## [5,] "1903"        "Abril  Fernández Figares"
## [6,] "1904"        "Abril  Fernández-Figares"

All cases require coercion.

# Deletion of all the extra spaces.
v.ape19 <- gsub("\\s{2,}", " ", v.ape18)

On the other hand, all padding spaces should be removed.

# Visualization of all the PADding spaces.
v.ape.pad.row <- grep("(^ )|( $)", v.ape19)
v.ape.pad.content <- v.ape19[v.ape.pad.row]
m.ape.pad <- cbind(v.ape.pad.row, v.ape.pad.content)
head(m.ape.pad)
##      v.ape.pad.row v.ape.pad.content 
## [1,] "3491"        "Aguirre "        
## [2,] "3915"        "Alcázar y de la "
## [3,] "8527"        "Egaña y "        
## [4,] "9751"        "Izene "          
## [5,] "13734"       "García "         
## [6,] "13735"       "García "

All cases require fixing.

# Deletion of all the padding spaces.
v.ape20 <- str_trim(v.ape19)

All family names are now clean according to our criteria.

2.2 Nombre

Like before, Spanish names should contain no digits nor punctuation marks besides apostrophes, hyphens and dots. Extra characters and spaces, such as non-printable characters or padding spaces should be deleted and fixed if necessary. Names can be compound of multiples words.

# Creation of a vector containing all the strings in column "nom".
v.nom00 <- data$nom

2.2.1 Digits

All digits should be removed. However, doing so might carry new mistakes such as broken words, incomplete words, etc.

# Visualization of all the rows containing DIGits.
v.nom.dig.row <- grep("\\d+", v.nom00)
v.nom.dig.content <- v.nom00[v.nom.dig.row]
m.nom.dig <- cbind(v.nom.dig.row, v.nom.dig.content)
head(m.nom.dig)
##      v.nom.dig.row v.nom.dig.content         
## [1,] "48243"       "14924"                   
## [2,] "48284"       "13805"                   
## [3,] "105616"      "Maria del Rosario0"      
## [4,] "106005"      "0"                       
## [5,] "120766"      "Vicente3"                
## [6,] "129857"      "Mª Concepción Francis0ca"

The deletion of all digits is a great solution for all the cases.

# Deletion of all the digits.
v.nom01 <- gsub("\\d+", "", v.nom00)

All digits have been removed. However, there are some cases containing extra symbols or spacing that still need fixing.

2.2.2 Control characters

All control characters should be removed. As before, doing so might carry new mistakes such as broken words, incomplete words, etc.

# Visualization of all the rows containing CONtrol characters.
v.nom.con.row <- grep("[[:cntrl:]]", v.nom01)
v.nom.con.content <- v.nom01[v.nom.con.row]
m.nom.con <- cbind(v.nom.con.row, v.nom.con.content)
head(m.nom.con)
##      v.nom.con.row v.nom.con.content              
## [1,] "2741"        "\r\r\nJulio"                  
## [2,] "3775"        "Mariano\r\r\nMariano"         
## [3,] "3988"        "Julian\r\r\nulian"            
## [4,] "4020"        "Julián\r\r\nJulián"           
## [5,] "8329"        "\r\r\nManuel"                 
## [6,] "16487"       "María Teresa\r\r\naría Teresa"

All observations can be fixed by deleting the control characters and its following characters, except for some cases that require a special solution.

# Correction of all the mistakes requiring a particular solution.
v.nom01[2741] = "Julio"  
v.nom01[8329] =  "Manuel"
v.nom01[56986] = "María Cruz"
v.nom01[78085] = "Juan Miguel"
v.nom01[102342] = "José Luis"
v.nom01[110897] = "Micaela" 
v.nom01[116956] = "María Esther"             
v.nom01[144726] = "Julián Benigno"  
v.nom01[157155] = "María del Carmen" 
v.nom01[182887] = "María Luisa" 
v.nom01[189135] = "Manuel"  
v.nom01[190350] = "Manuel" 
v.nom01[191047] = "María de las Mercedes"                     
v.nom01[208484] = "María del Carmen" 
v.nom01[218039] = "José Luis" 
v.nom01[218492] = "Miguel María" 
v.nom01[236342] = "María Carmen"   
v.nom01[331512] = "María Olga"
v.nom01[353093] = "José María" 
v.nom01[402772] = "José"
v.nom01[416385] = "Juan"  
v.nom01[419601] = "Jaime Jesus"
v.nom01[435114] = "Marcelino Alfredo"                           
v.nom01[435464] = "Juan"                      
v.nom01[441887] = "Juan" 
v.nom01[458871] = "María  del Pilar"  
v.nom01[464700] = "José Ignacio"
v.nom01[482161] = "María Luisa" 
v.nom01[482422] = "José María"
v.nom01[507616] = "Matilde" 

# Deletion of all the remaining control characters and its following characters.
v.nom02 <- gsub("[[:cntrl:]]+.*", "", v.nom01)

All control characters have been properly removed, but there are still many cases requiring a space there where these characters were.

2.2.3 Ordinal indicators

As previously done, all ordinal indicators should be replaced by their intended written meaning or, in case of ambiguity, by a dot.

# Visualization of all the rows containing MASculine ordinal indicators.
v.nom.mas.row <- grep("º", v.nom02)
v.nom.mas.content <- v.nom02[v.nom.mas.row]
m.nom.mas <- cbind(v.nom.mas.row, v.nom.mas.content)
head(m.nom.mas)
##      v.nom.mas.row v.nom.mas.content
## [1,] "3981"        "Juan Antº"      
## [2,] "4162"        "Juan Fº"        
## [3,] "5092"        "José Aº."       
## [4,] "5507"        "Juan Antº"      
## [5,] "8248"        "Mº  Nieves"     
## [6,] "13654"       "Felipa Mº"

All cases can be fixed either by applying a particular solution or by replacing the “º” by “aría”. The latter will be done together with the feminine ordinal indicator transformations.

# Correction of all the mistakes requiring a particular solution.
v.nom02[3981] = "Juan Antonio"          
v.nom02[4162] = "Juan F."            
v.nom02[5092] = "José Antonio"           
v.nom02[5507] = "Juan Antonio" 
v.nom02[41545] = "José Antonio"    
v.nom02[49425] = "José Antonio"          
v.nom02[49705] = "Antonio Alfonso" 
v.nom02[64699] = "Angel Antonio" 
v.nom02[85687] = "José Antonio" 
v.nom02[107021] = "José Antonio"          
v.nom02[112758] = "Juan Antonio"  
v.nom02[167725] = "Juan Antonio" 
v.nom02[201854] = "Luis Antonio"          
v.nom02[210211] = "Juan Antonio José"
v.nom02[260864] = "Antonio Ramón"   
v.nom02[267143] = "Vicente Antonio"       
v.nom02[285206] = "José Antonio"          
v.nom02[337274] = "Antonio"               
v.nom02[351582] = "José Antonio" 
v.nom02[390948] = "Juan Antonio"
v.nom02[419968] = "José Antonio" 
v.nom02[420271] = "Ambrosio Eugenio" 
v.nom02[464008] = "José Antonio"          
v.nom02[493846] = "Jose Antonio"          
v.nom02[503402] = "José Antonio"            
v.nom02[521045] = "Antonio"               
v.nom02[524236] = "Juan Antonio"   

Only “Fº” has been transformed to “F.”.

# Visualization of all the rows containing FEMinine ordinal indicators.
v.nom.fem.row <- grep("ª", v.nom02)
v.nom.fem.content <- v.nom02[v.nom.fem.row]
m.nom.fem <- cbind(v.nom.fem.row, v.nom.fem.content)
head(m.nom.fem)
##      v.nom.fem.row v.nom.fem.content        
## [1,] "1"           "Mª del Rocío"           
## [2,] "2"           "Mª Argentina Concepción"
## [3,] "5"           "Mª de los Angeles"      
## [4,] "6"           "Mª de los Dolores"      
## [5,] "23"          "Mª Concepción"          
## [6,] "24"          "Mª de los Remedios"

All cases can be replaced by a “aría”.

# Replacement of all the and "ª"s and the remaining "º"s by "aría"s.
v.nom03 <- gsub("(º|ª)", "aría", v.nom02)

All cases have been properly fixed, but there are still some cases with improper casing and/or spacing.

2.2.4 Punctuation marks

2.2.4.1 Apostrophes

All apostrophes should be surrounded by alphabetic characters.

# Visualization of all the rows containing APOstrophes.
v.nom.apo.row <- grep("'", v.nom03)
v.nom.apo.content <- v.nom03[v.nom.apo.row]
m.nom.apo <- cbind(v.nom.apo.row, v.nom.apo.content)
head(m.nom.apo)
##      v.nom.apo.row v.nom.apo.content       
## [1,] "61575"       "María José de L'"      
## [2,] "181902"      "Mar'ia Rosa"           
## [3,] "183788"      "María de las Nieves d'"
## [4,] "468988"      "Enrique d'"

All cases are correct except one: v.nom03[181902] = "Mar'ia Rosa", that should have the apostrophe removed.

# Correction of all the mistakes requiring a particular solution.
v.nom03[181902] = "María Rosa" 

2.2.4.2 Dots

All dots should be preceded by an isolated upper-case letter and followed by a space.

# Visualizations of all the rows containing DOTs between letters.
v.nom.dot.row <- grep("([[:alpha:]])(\\.)([[:alpha:]])", v.nom03)
v.nom.dot.content <- v.nom03[v.nom.dot.row]
m.nom.dot <- cbind(v.nom.dot.row, v.nom.dot.content)
head(m.nom.dot)
##      v.nom.dot.row v.nom.dot.content             
## [1,] "2261"        "Marcelino P.J."              
## [2,] "38027"       "María.Luz"                   
## [3,] "45286"       "F.Javier"                    
## [4,] "45455"       "Alejandro F.J."              
## [5,] "77090"       "María de la Paz de la S.Trin"
## [6,] "97231"       "A.Montserrat"

All cases either need a space between the dot and the following word or require a particular solution.

# Correction of all the mistakes requiring a particular solution.
v.nom03[38027] = "María Luz"  
v.nom03[77090] = "María de la Paz de la Santa Trinidad"
v.nom03[105662] = "José del Santo Sacramento"  
v.nom03[175396] = "María Inmaculada de San Diego" 
v.nom03[211023] = "Daría Sacramento" 
v.nom03[271021] = "María de la Concepción"  
v.nom03[423061] = "Salvador" 
v.nom03[426496] = "Laureano" 
v.nom03[485213] = "Antonia María"

# Addition of a space between the dot and the following word of the remaining cases.
v.nom04 <- gsub("([[:alpha:]])(\\.)([[:alpha:]])", "\\1\\2 \\3", v.nom03)

All cases are now correct, but there might still be dots after non upper-case letters.

# Visualization of all the rows containing Dots After a Lower-case letter.
v.nom.dal.row <- grep("([a-z])(\\.)", v.nom04)
v.nom.dal.content <- v.nom04[v.nom.dal.row]
m.nom.dal <- cbind(v.nom.dal.row, v.nom.dal.content)
head(m.nom.dal)
##      v.nom.dal.row v.nom.dal.content
## [1,] "3821"        "Juan Bta."      
## [2,] "4472"        "Juan Btaría."   
## [3,] "4569"        "Juan Bta."      
## [4,] "9795"        "José Ant."      
## [5,] "13618"       "Juan Bta."      
## [6,] "34420"       "Jose c."

All dots are correct, except for some cases where the dots are meaningless, repeated or need a space on their right.

# Correction of all the mistakes requiring a particular solution.
v.nom04[38174] = "María Luisa"                
v.nom04[38664] = "Ana María"                  
v.nom04[38839] = "María del Carmen"           
v.nom04[39203] = "Juan María" 
v.nom04[66480] = "Manuel R." 
v.nom04[108780] = "Josefa María"               
v.nom04[109492] = "José María"                 
v.nom04[110527] = "José María" 
v.nom04[133614] = "Heliodora T." 
v.nom04[202414] = "Sergio A." 
v.nom04[210211] = "Juan Antaría José"          
v.nom04[226382] = "Teresa del"                 
v.nom04[226622] = "María de los Angeles"       
v.nom04[226739] = "María America"              
v.nom04[226787] = "María de la Paz"            
v.nom04[227675] = "María Soledad"              
v.nom04[245479] = "José María"                 
v.nom04[246337] = "Eladia María"               
v.nom04[328945] = "José María"                 
v.nom04[329067] = "Gracia María"               
v.nom04[367422] = "Franco Miguel"              
v.nom04[380460] = "José C. Fco."                 
v.nom04[423832] = "Manuela y"                  
v.nom04[449652] = "Josefa"                     
v.nom04[449798] = "Pedro"                      
v.nom04[449978] = "Franco"                     
v.nom04[450323] = "Maria del Carmen"           
v.nom04[496194] = "José María"                 
v.nom04[496195] = "José María"                 
v.nom04[541750] = "Jesús María"                
v.nom04[541751] = "Jesús María"                
v.nom04[541964] = "Ana María"                  
v.nom04[542152] = "José María"                 
v.nom04[542299] = "José María"                 
v.nom04[542404] = "María Angeles"              
v.nom04[542494] = "Jose María"                 
v.nom04[542965] = "Alfonso María Ligorio"

All dots have been properly fixed.

2.2.4.3 Hyphens

All hyphens should be surrounded by alphabetic characters.

# Visualization of all the rows containing HYPhens.
v.nom.hyp.row <- grep("-", v.nom04)
v.nom.hyp.content <- v.nom04[v.nom.hyp.row]
m.nom.hyp <- cbind(v.nom.hyp.row, v.nom.hyp.content)
head(m.nom.hyp)
##      v.nom.hyp.row v.nom.hyp.content          
## [1,] "1626"        "Tarsicio-Mañé"            
## [2,] "2536"        "Angel-Raimundo"           
## [3,] "6577"        "Antonio-Félix"            
## [4,] "6733"        "María del Pilar-Felicitas"
## [5,] "6740"        "Manuel-Pedro"             
## [6,] "8690"        "Mauricio-Oko"

Most cases are correct, but some others are surrounded by at least one space or seem to be a mistake.

# Correction of all the mistakes requiring a particular solution.
v.nom04[19092] = "Servando"  
v.nom04[479660] = ""                           
v.nom04[480488] = ""   

# Deletion of all the spaces to the right of a hyphen.
v.nom05 <- gsub("(-)( +)", "\\1", v.nom04)

# Deletion of all the spaces to the left of a hyphen.
v.nom06 <- gsub("( +)(-)", "\\2", v.nom05)

All hyphens are now surrounded by alphabetic characters. However, there are still names with incorrect casing.

# Visualizations of all the rows containing a stopword or a NON-upper-case letter after a hyphen.
v.nom.non.row <- grep("(-)([[:lower:]])", v.nom06)
v.nom.non.content <- v.nom06[v.nom.non.row]
m.nom.non <- cbind(v.nom.non.row, v.nom.non.content)
head(m.nom.non)
##      v.nom.non.row v.nom.non.content
## [1,] "53865"       "José-luis"      
## [2,] "103012"      "Mari-bel"       
## [3,] "108414"      "Al-lal"         
## [4,] "159370"      "Mar-ía Estrella"
## [5,] "166811"      "Jesú-s"         
## [6,] "166905"      "M-aría Josefa"

All but 2 cases should be fixed.

# Correction of all the cases requiring a particular solution.
v.nom06[53865] = "José Luis"
v.nom06[103012] = "Maribel" 
v.nom06[159370] = "María Estrella" 
v.nom06[166811] = "Jesús"                      
v.nom06[166905] = "María Josefa" 
v.nom06[293425] = "Antonio-Antero" 
v.nom06[463940] = "Heinz-Gerd"
v.nom06[519404] = "Adela"  

2.2.4.4 Question marks

All question marks should be deleted.

# Visualization of all the rows containing QUEstion marks.
v.nom.que.row <- grep("\\?+", v.nom06)
v.nom.que.content <- v.nom06[v.nom.que.row]
m.nom.que <- cbind(v.nom.que.row, v.nom.que.content)
head(m.nom.que)
##      v.nom.que.row v.nom.que.content
## [1,] "2066"        "?"              
## [2,] "9817"        "?"              
## [3,] "16124"       "?"              
## [4,] "16313"       "Eusebio ?"      
## [5,] "18460"       "?"              
## [6,] "21477"       "María Lioha?"

All observations can be fixed by deleting the question mark.

# Deletion of all the question marks.
v.nom07 <- gsub("\\?+", "", v.nom06)

2.2.4.5 Others

Any other kind of punctuation mark should be removed.

# Visualizations of all the rows containing other kind of PUNctuation marks.
v.nom.pun.row <- grep("(?!['.?-])[[:punct:]]", v.nom07, perl = TRUE)
v.nom.pun.content <- v.nom07[v.nom.pun.row]
m.nom.pun <- cbind(v.nom.pun.row, v.nom.pun.content)
head(m.nom.pun)
##      v.nom.pun.row v.nom.pun.content
## [1,] "13240"       "José B:"        
## [2,] "21033"       "Gerardo de la+" 
## [3,] "23577"       "Angel María,"   
## [4,] "49273"       "José  María;"   
## [5,] "52298"       "Marcelino+"     
## [6,] "52306"       "Francisco+"

All cases must be fixed and require a particular solution.

# Correction of all the mistakes requiring a particular solution.
v.nom07[13240] = "José B." 
v.nom07[21033] = "Gerardo de la"                  
v.nom07[148618] = "María Luz" 
v.nom07[158034] = "María Gloria"    
v.nom07[170095] = "María de la Paz"
v.nom07[219431] = "María M." 
v.nom07[282701] = "María Jesus"             
v.nom07[282778] = "María del Carmen"  
v.nom07[385174] = "María Del Pilar"  
v.nom07[425551] = "Amparo"         
v.nom07[426141] = "Constantina"    
v.nom07[426764] = "María"          
v.nom07[426932] = "Josefa"         
v.nom07[426948] = "Basilisa"       
v.nom07[426990] = "Carmen" 
v.nom07[457593] = "José María Rufo" 
v.nom07[467354] = "Francisco A."         
v.nom07[475260] = "Francisco de Paula" 

# Deletion of the remaining punctuation marks.
v.nom08 <- gsub("(?!['.?-])[[:punct:]]", "", v.nom07, perl = TRUE)

However, although accents are not considered punctuation marks, they are sometimes confused as such by the typewriter.

# Visualization of all the rows containing isolated ACCents.
v.nom.acc.row <- grep("[`´^¨]", v.nom08, perl = TRUE)
v.nom.acc.content <- v.nom08[v.nom.acc.row]
m.nom.acc <- cbind(v.nom.acc.row, v.nom.acc.content)
head(m.nom.acc)
##      v.nom.acc.row v.nom.acc.content       
## [1,] "49944"       "josé Mar´´ia"          
## [2,] "53394"       "Jo´se"                 
## [3,] "110651"      "´miguel"               
## [4,] "182988"      "José´de"               
## [5,] "259944"      "Teodoro d´"            
## [6,] "265957"      "María de los ´Remedios"

All cases should be fixed.

# Correction of all the mistakes requiring a particular solution.
v.nom08[49944] = "José María"          
v.nom08[53394] = "Jóse"                 
v.nom08[110651] = "Miguel"               
v.nom08[182988] = "José de"               
v.nom08[259944] = "Teodoro d'"            
v.nom08[265957] = "María de los Remedios"
v.nom08[437959] = "O'Donell-Artemio"      
v.nom08[437960] = "O'Donnell-Artemio"
v.nom08[448052] = "Juan Antonio"  

All unwanted symbols have been removed or fixed, but there are still many inconsistencies regarding casing and spacing. Moreover, there are words still attached together.

2.2.5 Casing

According to the Spanish grammar rules, all words in a name should be capitalized except for stopwords (excluding isolated articles). See Diccionario panhispánico de dudas - Mayúsculas, 4.3.

All words containing internal upper-case letters should be fixed. Whereas apostrophes are considered part of a word, hyphens are not.

# Visualization of all the rows containing words with internal upper-case letters.
v.nom.upp.row <- grep("([[:alpha:]])([[:upper:]])", v.nom08)
v.nom.upp.content <- v.nom08[v.nom.upp.row]
m.nom.upp <- cbind(v.nom.upp.row, v.nom.upp.content)
head(m.nom.upp)
##      v.nom.upp.row v.nom.upp.content             
## [1,] "4280"        "AntoniO"                     
## [2,] "5300"        "Francisco JAVIER"            
## [3,] "7079"        "MaríaMercedes Felisa Laura J"
## [4,] "7269"        "ALicia"                      
## [5,] "7855"        "Soledad ADELAIDA de los D"   
## [6,] "7899"        "Lucia Carmen NIeves"

Most observations can be fixed by adding a space between the upper-case letters and their preceding lower-case letter, since they are just two attached names. However, some observations require a particular solution.

# Correction of all the mistakes requiring a particular solution.
v.nom08[4280] = "Antonio"                       
v.nom08[5300] = "Francisco Javier"              
v.nom08[7079] = "María Mercedes Felisa Laura J." 
v.nom08[7269] = "Alicia"
v.nom08[7855] = "Soledad Adelaida de los D." 
v.nom08[7899] = "Lucia Carmen Nieves"           
v.nom08[7987] = "Pedro José"                    
v.nom08[8385] = "Miguel Angel"                  
v.nom08[8421] = "Miguel Angel"                  
v.nom08[8480] = "MIcaela"                       
v.nom08[8491] = "Miguel Angel"                  
v.nom08[8529] = "Miguel José"                   
v.nom08[8660] = "Miguel Antonio"                
v.nom08[8763] = "María Nieves Constantina"      
v.nom08[17509] = "José Manuel"                         
v.nom08[19964] = "José Javier"                    
v.nom08[21339] = "Manuel"                        
v.nom08[21459] = "Filomena"                      
v.nom08[22695] = "Jaime T."                       
v.nom08[25321] = "José Luis"                     
v.nom08[33743] = "María Luisa"                      
v.nom08[48047] = "María Josefa"                  
v.nom08[48271] = "María"                         
v.nom08[50097] = "Victor José"        
v.nom08[50396] = "María de las Nieves"               
v.nom08[56765] = "María de las C."
v.nom08[58517] = "María de los Dolores"       
v.nom08[66960] = "María Luisa"                        
v.nom08[91347] = "Angel María"                                
v.nom08[95146] = "José Antonio"       
v.nom08[95188] = "Antonio"                       
v.nom08[95196] = "José Manuel"                   
v.nom08[95213] = "Antonio"                       
v.nom08[95347] = "Carlos Abdón"                  
v.nom08[95352] = "José Manuel"                   
v.nom08[102473] = "Pedro Pío"                     
v.nom08[102876] = "Jose Luis"                     
v.nom08[108321] = "Bernardo"                      
v.nom08[112486] = "Jorge"                                       
v.nom08[124219] = "Miguel Ramon"                  
v.nom08[124602] = "José"                      
v.nom08[131802] = "Martin"                              
v.nom08[134729] = "Emilia"                        
v.nom08[137098] = "Joaquín"                                   
v.nom08[140868] = "Rosa María"               
v.nom08[144713] = "María Ignacia"                
v.nom08[147567] = "Carmen"                             
v.nom08[155074] = "José de la Esperanza"                      
v.nom08[156323] = "María del Carmen"              
v.nom08[159719] = "Lorenza de"                                  
v.nom08[160444] = "María de los Dolores"          
v.nom08[160912] = "María Angelines"               
v.nom08[162450] = "Sinésio"                       
v.nom08[162950] = "Ginésa"                        
v.nom08[163981] = "Ramón"                             
v.nom08[180925] = "María de los Dolores"          
v.nom08[180927] = "Gaspar"                        
v.nom08[180984] = "Angeles María del Carmen"      
v.nom08[181145] = "María del Pino"                
v.nom08[181169] = "María de los Angeles"          
v.nom08[181807] = "María del Carmen"              
v.nom08[183297] = "María Jesús"                     
v.nom08[185575] = "María de la Concepción"             
v.nom08[187478] = "María Teresa"       
v.nom08[190572] = "María Teresa"      
v.nom08[191382] = "Adela I."                              
v.nom08[202550] = "Arcadio"                                 
v.nom08[204963] = "Ana María"            
v.nom08[205621] = "José María"          
v.nom08[207766] = "María del Carmen"              
v.nom08[211122] = "María del Carmen"              
v.nom08[213848] = "Luisa Angelines"               
v.nom08[214140] = "Liborio Ginés"                             
v.nom08[214892] = "María de la Paz"               
v.nom08[215393] = "Maria de los Dolores"                        
v.nom08[217005] = "Juan Manuel"        
v.nom08[217068] = "José"                      
v.nom08[217099] = "Juan Antonio"       
v.nom08[217104] = "Juan Antonio"       
v.nom08[217105] = "Manuel"                  
v.nom08[217167] = "Milagros"              
v.nom08[217391] = "Miguel Angel"       
v.nom08[218011] = "José"                                       
v.nom08[223547] = "María Jesús"              
v.nom08[225180] = "Jose Luis"                                   
v.nom08[239016] = "María Julita"       
v.nom08[239195] = "Jaime"                                    
v.nom08[240956] = "María del Amparo"              
v.nom08[240961] = "María del Carmen"                          
v.nom08[246408] = "Higinia"                       
v.nom08[246578] = "Milagros"                      
v.nom08[246701] = "María del Rocío" 
v.nom08[250396] = "María del Mar"       
v.nom08[256108] = "María del Carmen"       
v.nom08[265402] = "María Luisa"        
v.nom08[267021] = "María de los Dolores"   
v.nom08[270280] = "María del Mar"                 
v.nom08[270290] = "José"                                        
v.nom08[270978] = "María Felisa"                 
v.nom08[271240] = "Manuel"                                  
v.nom08[280918] = "Juan José"            
v.nom08[281174] = "Jorge Luis"          
v.nom08[281949] = "Juan"                      
v.nom08[284251] = "Juan de la"                        
v.nom08[291959] = "Joaquín"                
v.nom08[294432] = "María Carmen"      
v.nom08[294808] = "María Angeles"    
v.nom08[296506] = "José Fernando"       
v.nom08[299432] = "Magdalena"            
v.nom08[299740] = "Jacinto"                
v.nom08[299920] = "María Luisa"        
v.nom08[302117] = "José Antonio"       
v.nom08[307504] = "José Ernesto"       
v.nom08[312112] = "Manuel"                  
v.nom08[324086] = "Jesús"                         
v.nom08[324088] = "Jesús"                    
v.nom08[330291] = "Josefina"              
v.nom08[340864] = "Jesús Serafín Mateo"           
v.nom08[340865] = "Jesús Serafín Mateo"           
v.nom08[343142] = "José"                      
v.nom08[343934] = "José Luis"            
v.nom08[344043] = "José Antonio"        
v.nom08[353093] = "José María"
v.nom08[354107] = "Manuel"  
v.nom08[377564] = "Manuel"  
v.nom08[383468] = "María Pilar"        
v.nom08[383789] = "María Victoria"  
v.nom08[383792] = "José Maria"          
v.nom08[383961] = "María Josefa"      
v.nom08[384030] = "María del Pilar"      
v.nom08[384066] = "María"                    
v.nom08[384621] = "Marcos"                  
v.nom08[389654] = "Julia del"            
v.nom08[393963] = "Jose Luis"                     
v.nom08[399485] = "María"                    
v.nom08[399690] = "Ana María"                
v.nom08[400146] = "María"                    
v.nom08[400492] = "Federico"                      
v.nom08[400586] = "José Maria"          
v.nom08[403656] = "María"                    
v.nom08[403685] = "María del Rosario" 
v.nom08[406804] = "José Luis"            
v.nom08[410602] = "María del Pilar"       
v.nom08[413860] = "Julián"                        
v.nom08[415008] = "María Milagros"   
v.nom08[416476] = "María"                    
v.nom08[417383] = "Marcelino Ezequiel"            
v.nom08[417478] = "Juan de la Cruz"               
v.nom08[417711] = "Angel Luis"                        
v.nom08[420581] = "Jose"                          
v.nom08[421629] = "Juan Bautista"       
v.nom08[423774] = "Joaquin"                
v.nom08[424443] = "Nicolás"                       
v.nom08[424456] = "Milagros"                      
v.nom08[424637] = "Miguel"                        
v.nom08[424894] = "Miguel"                        
v.nom08[425216] = "Miguel"                        
v.nom08[425692] = "Victoriano"                    
v.nom08[425780] = "María del Pilar"                        
v.nom08[431638] = "María Fe"                      
v.nom08[431879] = "María Luisa"                   
v.nom08[431882] = "María Juana"                   
v.nom08[434431] = "María Luz Leonor"              
v.nom08[437418] = "Pedro Luis Pascual"            
v.nom08[442072] = "Carmen"                        
v.nom08[442085] = "Manuel"                        
v.nom08[442168] = "María del Carmen"  
v.nom08[442590] = "María de Guadalupe"            
v.nom08[442652] = "María del Pilar"                  
v.nom08[451505] = "Miguel del"          
v.nom08[453569] = "María Concepción" 
v.nom08[455414] = "José Elias"          
v.nom08[456462] = "Ana María Teresa"              
v.nom08[456860] = "Julia Isabel"       
v.nom08[457257] = "José"                      
v.nom08[460932] = "Carlos Jacinto"                       
v.nom08[465634] = "Agustín"                       
v.nom08[472221] = "José"                      
v.nom08[480427] = "Mercedes"              
v.nom08[481119] = "Julián José"        
v.nom08[481867] = "José Gil"                      
v.nom08[482406] = "Guillermo José María"
v.nom08[482667] = "José María"          
v.nom08[484320] = "Luis María"          
v.nom08[493070] = "Manuel"                  
v.nom08[501752] = "María Araceli"    
v.nom08[507616] = "Matilde"                 
v.nom08[508022] = "María Teresa"      
v.nom08[508111] = "María de los Angeles"  
v.nom08[510664] = "Manuel"                                    
v.nom08[521957] = "María Isabel"                  
v.nom08[523962] = "Manuel"                  
v.nom08[526539] = "José María"                          
v.nom08[533909] = "Ginés Pascual"                 
v.nom08[534239] = "María Julia de los Angeles"                
v.nom08[535637] = "Jesús Manuel"       
v.nom08[537571] = "Angelines"                     
v.nom08[539789] = "María de Gracia"       
v.nom08[545790] = "Julián Vicente"          
v.nom08[545808] = "José Luis"                             
v.nom08[552883] = "Gloria Balbina"                
v.nom08[555988] = "María Luisa"   

# Correction of family names wrongly mixed with the personal name.
v.ape20[183297] = "Los Arcos y Azcona"
v.ape20[344043] = "De la Fuente Antunez"

# Addition of a space between all upper-case letters and their preceding lower-case letter.
v.nom09 <- gsub("([a-z])([[:upper:]])", "\\1\\ \\2", v.nom08)

All words starting with lower-case letters should be fixed.

# Visualization of all the rows containing words starting with LOWer-case letters.
v.nom.low.row <- grep("\\<[[:lower:]]", v.nom09)
v.nom.low.content <- v.nom09[v.nom.low.row]
m.nom.low <- cbind(v.nom.low.row, v.nom.low.content)
head(m.nom.low)
##      v.nom.low.row v.nom.low.content      
## [1,] "1"           "María del Rocío"      
## [2,] "5"           "María de los Angeles" 
## [3,] "6"           "María de los Dolores" 
## [4,] "24"          "María de los Remedios"
## [5,] "50"          "María del Rosario"    
## [6,] "51"          "María del Carmen"
# Transformation of all word-beginnings to upper-case letters.
v.nom10 <- str_to_title(v.nom09)

# Transformation of all word-beginnings after an apostrophe to upper-case letters.
v.nom11 <- gsub("(')(.?)", "\\1\\U\\2", v.nom10, perl = TRUE)

All words are now capitalized, but stopwords need to be lowered.

v.nom12 <- gsub((stopwords1), transformation1, v.nom11, ignore.case = TRUE, perl = TRUE)
v.nom13 <- gsub((stopwords2), transformation2, v.nom12, ignore.case = TRUE, perl = TRUE)
v.nom14 <- gsub((stopwords3), transformation3, v.nom13, ignore.case = TRUE, perl = TRUE)
v.nom15 <- gsub("(\\<De La )", "De la ", v.nom14)
v.nom16 <- gsub("(\\<De Los )", "De los ", v.nom15)
v.nom17 <- gsub("(\\<De Las )", "De las ", v.nom16)

All Spanish conjunctions “o” and Portuguese masculine singular articles “o” should be lower-cased.

# Visualization of all the rows containing " O "s. 
v.nom.ooo.row <- grep(" O ", v.nom17)
v.nom.ooo.content <- v.nom17[v.nom.ooo.row]
m.nom.ooo <- cbind(v.nom.ooo.row, v.nom.ooo.content)
head(m.nom.ooo)
##      v.nom.ooo.row v.nom.ooo.content          
## [1,] "130045"      "José María Riard O Enriqu"
## [2,] "147583"      "María de la O Clementina" 
## [3,] "339172"      "Ana Maria de la O Y"      
## [4,] "447744"      "Alejandro O Alejandra"    
## [5,] "511105"      "María de la O Adelaida"
# Correction of all the mistakes requiring a particular solution.
v.nom17[130045] = "José María Ricardo Enrique"
v.nom17[147583] = "María de la O Clementina" 
v.nom17[339172] = "Ana Maria de la O "
v.nom17[447744] = "Alejandro"

All words are properly cased now. However, there are still many improper spacings.

2.2.6 Spacing

All extra spaces should be removed. On the one hand, 2 or more consecutive spaces should be coerced into one.

# Visualization of all the rows containing 2 or more consecutive SPAces.
v.nom.spa.row <- grep("\\s{2,}", v.nom17)
v.nom.spa.content <- v.nom17[v.nom.spa.row]
m.nom.spa <- cbind(v.nom.spa.row, v.nom.spa.content)
head(m.nom.spa)
##      v.nom.spa.row v.nom.spa.content          
## [1,] "7499"        "Fco Juan Lázaro del  Cora"
## [2,] "7694"        "Felicidad  María de Lourd"
## [3,] "8248"        "María  Nieves"            
## [4,] "14521"       "María de la  Concepción"  
## [5,] "14738"       "Juana  María Mercedes"    
## [6,] "18616"       "María  Teresa"

All cases require coercion.

# Deletion of all the extra spaces.
v.nom18 <- gsub("\\s{2,}", " ", v.nom17)

On the other hand, all padding spaces should be removed.

# Visualization of all the PADding spaces.
v.nom.pad.row <- grep("(^ )|( $)", v.nom18)
v.nom.pad.content <- v.nom18[v.nom.pad.row]
m.nom.pad <- cbind(v.nom.pad.row, v.nom.pad.content)
head(m.nom.pad)
##      v.nom.pad.row v.nom.pad.content
## [1,] "16313"       "Eusebio "       
## [2,] "86881"       "María de las "  
## [3,] "101180"      "Juan "          
## [4,] "114972"      "Alejandro "     
## [5,] "135166"      " Antonio"       
## [6,] "141676"      "Alegría "

All cases require fixing.

# Deletion of all the padding spaces.
v.nom19 <- str_trim(v.nom18)

All names are now clean according to our criteria.

2.3 Legajo

Folder numbers should only contain digits or, there where the number is unknown, question marks.

# Creation of a vector containing all the strings in column "ape".
v.leg00 <- data$leg

2.3.1 Non-digits

All non-digits should be either removed or transformed to a question mark.

# Visualization of all the rows containing NOn-Digits.
v.leg.nod.row <- grep("\\D", v.leg00)
v.leg.nod.content <- v.leg00[v.leg.nod.row]
m.leg.nod <- cbind(v.leg.nod.row, v.leg.nod.content)
head(m.leg.nod)
##      v.leg.nod.row v.leg.nod.content
## [1,] "2560"        "?"              
## [2,] "2707"        "?"              
## [3,] "5643"        "?"              
## [4,] "10101"       "?"              
## [5,] "11920"       "?"              
## [6,] "12257"       "?"

All cases should have the non-digit transformed to a question mark except for v.leg00[430315] = "15041 a 43", which should have the " a 43" removed.

# Correction of all the mistakes requiring a particular solution.
v.leg00[430315] = "15041"

# Transformation of all the remaining non-digits to question marks.
v.leg01 <- gsub("\\D", "?", v.leg00)

All folder numbers are correct now. However, there might still be extra spaces.

2.3.2 Spacing

All extra spaces should be removed. On the one hand, internal spaces should be transformed to question marks.

# Visualization of all the rows containing internal SPAces.
v.leg.spa.row <- grep("[[:space:]]+", v.leg01)
v.leg.spa.content <- v.leg01[v.leg.spa.row]
m.leg.spa <- cbind(v.leg.spa.row, v.leg.spa.content)
head(m.leg.spa)
##      v.leg.spa.row v.leg.spa.content

There are no internal spaces.

On the other hand, all padding spaces should be removed.

# Visualization of all the PADding spaces.
v.leg.pad.row <- grep("(^ )|( $)", v.leg01)
v.leg.pad.content <- v.leg01[v.leg.pad.row]
m.leg.pad <- cbind(v.leg.pad.row, v.leg.pad.content)
head(m.leg.pad)
##      v.leg.pad.row v.leg.pad.content

As before, there are no cases of padding spaces.

All folder numbers are now clean according to our criteria.

2.4 Nº de Expediente

Record numbers should only contain digits or, there where the number is unknown, question marks. However, there might be cases containing the sub-string " Bis", which should be left untouched or corrected if necessary.

# Creation of a vector containing all the strings in column "num".
v.num00 <- data$num

2.4.1 Non-digits

All non-digits should be either removed or transformed to a question mark.

# Visualization of all the rows containing NOn-Digits excluding question marks and those with the sub-string "Bis" or its variations.
v.num.nod.row <- grep("^(?!.*bis|.*\\?).*\\D", v.num00, ignore.case = TRUE, perl = TRUE)
v.num.nod.content <- v.num00[v.num.nod.row]
m.num.nod <- cbind(v.num.nod.row, v.num.nod.content)
head(m.num.nod)
##      v.num.nod.row v.num.nod.content
## [1,] "2979"        "-"              
## [2,] "3115"        "-"              
## [3,] "21552"       "-"              
## [4,] "22541"       "78+"            
## [5,] "42834"       "-"              
## [6,] "48243"       "m"

All cases should be transformed to a question mark.

# Transformation of all the remaining non-digits to question marks.
v.num01 <- gsub("^(?!.*bis|.*\\?)(\\d*)(\\D+)", "\\1?", v.num00, ignore.case = TRUE, perl = TRUE)

All sub-strings “Bis” or its variations should be detached from the digit and start with an upper-case letter.

# Visualization of all the rows containing the sub-string "Bis" or its variations.
v.num.bis.row <- grep("bis", v.num01, ignore.case = TRUE)
v.num.bis.content <- v.num01[v.num.bis.row]
m.num.bis <- cbind(v.num.bis.row, v.num.bis.content)
head(m.num.bis)
##      v.num.bis.row v.num.bis.content
## [1,] "12757"       "8bis"           
## [2,] "12834"       "17bis"          
## [3,] "12835"       "17bis"          
## [4,] "13543"       "24bis"          
## [5,] "13654"       "15bis"          
## [6,] "13659"       "47 Bis"

Some cases need to be fixed. Moreover, extra non-digits should be either removed or transformed to a question mark.

# Correction of all the mistakes requiring a particular solution.
v.num01[411679] = "5 Bis" 
v.num01[496659] = "14 Bis"       
v.num01[496750] = "44 Bis"       
v.num01[496912] = "19 Bis"       
v.num01[497568] = "15 Bis"       
v.num01[498770] = "29 Bis"       
v.num01[499613] = "33 Bis"  

# Addition of a space between all sub-strings "Bis" and their preceding characters.
v.num02 <- gsub("(\\d+)( *)([[:alpha:]])", "\\1\\ \\3", v.num01)

# Transformation of all word-beginnings to upper-case letters.
v.num03 <- str_to_title(v.num02)

All record numbers are correct now. However, there might still be improper spacings.

2.4.2 Spacing

All extra spaces should be removed. On the one hand, internal spaces surrounded by digits should be transformed to question marks.

# Visualization of all the rows containing internal SPaces surrounded by Digits.
v.num.spd.row <- grep("([[:space:]]+)(\\d+)", v.num03)
v.num.spd.content <- v.num03[v.num.spd.row]
m.num.spd <- cbind(v.num.spd.row, v.num.spd.content)
head(m.num.spd)
##      v.num.spd.row v.num.spd.content

There are no internal spaces between digits.

On the other hand, all padding spaces should be removed.

# Visualization of all the PADding spaces.
v.num.pad.row <- grep("(^ )|( $)", v.num03)
v.num.pad.content <- v.num03[v.num.pad.row]
m.num.pad <- cbind(v.num.pad.row, v.num.pad.content)
head(m.num.pad)
##      v.num.pad.row v.num.pad.content

As before, there are no cases of padding spaces.

All record numbers are now clean according to our criteria.

2.5 Tipo de Expediente

Kinds of record cannot be properly cleansed due to the lack of homogenous information. However, there shouldn’t be many distinct examples so strange symbols or misspellings can be easily detected by visualizing all distinct observations and removed or fixed.

# Creation of a vector containing all the strings in column "tip".
v.tip00 <- data$tip

# Visualization of each distinct term's frequency.
head(table(v.tip00))
## v.tip00
##                 ?         ?Personal                 4 Auxiliar Interino 
##            221967                 2                 1                 2 
## Ayudante Interino              Baja 
##                 2                 8

Most cases can be transformed manually but some misspellings can be fixed automatically using approximate matching based on a list with the intended string and substituting them with their match.

# Correction of all the mistakes requiring a particular solution.
v.tip00[v.tip00 == "?Personal"] <- "Personal"
v.tip00[v.tip00 == "Personal: Veterinaria"] <- "Personal Veterinaria"
v.tip00[v.tip00 == "Depuración-rehabilitado"] <- "Depuración Rehabilitado"
v.tip00[v.tip00 == "Depuracion MaestrosDepuracion Maestros"] <- "Depuración Maestros"
v.tip00[v.tip00 == "Gubernatibo"] <- "Gubernativo"
v.tip00[v.tip00 %in% c("4", "º", "ç", "Rodríguez Villa")] <- NA

# Matching of all the misspellings of the strings gathered in corrections.
corrections <- c("Personal Docente", "Depuración Maestros", "Profesional")

match <- amatch(v.tip00, corrections, method = "jw")

match[match == 1] <- "Personal Docente"
match[match == 2] <- "Depuración Maestros"
match[match == 3] <- "Profesional"

# Transformation of all word-beginnings to upper-case letters.
v.tip01 <- str_to_title(v.tip00)

# Correction of the misspellings.
v.tip01[!is.na(match)] <- match[!is.na(match)]

All kinds of record are now clean according to our criteria.

2.6 Especialidad

Due to the lack of homogeneous information, job specialization cannot be properly cleansed. For the time being, this variable will be left untouched.

# Creation of a vector containing all the strings in column "esp".
v.esp00 <- data$esp

3 New variables

Some new variables should be created before reassembling the cleansed data set. They are divided into two types:

Inferred variables:

  • GÉNERO: string with the person’s gender: “M” = male and “F” = female. Uninferred genders remain empty (NA).

Observational variables:

  • APELLIDOS EN EL ORIGINAL: string with the first and second family name there where changes have taken place (excluding: different casing and/or spacing and the deletion of control characters). Otherwise empty (NA).

  • NOMBRE EN EL ORIGINAL: string with the personal name there where changes have taken place, excluding: different casing and/or spacing, the transformation of “ª”s to “aría”s and the deletion of control characters. Otherwise empty (NA).

  • DEPURACIÓN: logical value for wether the person’s Nº DE EXPEDIENTE and/or ESPECIALUDAD indicate they should be classified as “depurado/a”.

  • TITULACIÓN: logical value for wether the person’s Nº DE EXPEDIENTE and/or ESPECIALUDAD indicate they should be classified as “titulado/a”.

  • NOTA: string with any remarkable observation (religious address, name ambiguity, etc.). Otherwise empty (NA).

3.1 Inferred variables

3.1.1 Gender

In order to infer the person’s gender a reference data base is used. This is an IHR self-created data base for internal use only and contains all names gathered by the INE and the IDESCAT.

# Reading of the data.
names <- read_csv("BdD Nombres IHR.csv")

# Visualization of the structure of the data.
str(names)
## spec_tbl_df [70,569 × 2] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Nom       : chr [1:70569] "ANTONIO" "MANUEL" "FRANCISCO" "DAVID" ...
##  $ GÈNERE IHR: chr [1:70569] "M" "M" "M" "M" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Nom = col_character(),
##   ..   `GÈNERE IHR` = col_character()
##   .. )
# Simplification of the names of the columns.
colnames(names) <- c("name", "gender")

# Simplification of the characters in column "name".
names$name <- stri_trans_general(stri_trans_tolower(names$name), id = "Latin-ASCII")

A vector of names grouped by gender is created. These will be contrasted with the cleansed names stored in object v.nom19 in order to assign them their corresponding gender.

# Creation of two vectors of names grouped by gender: MASculine and FEMinine.
v.name.mas <- names$name[names$gender == "M"]
v.name.fem <- names$name[names$gender == "F"]

Since all names in v.name.mas and v.name.fem have been simplified, a copy of v.nom19 is also created and simplified.

# SIMplification of all clean names, which are stored in object v.nom19.
v.nom19.sim <- stri_trans_general(stri_trans_tolower(v.nom19), id = "Latin-ASCII")

# All initialisms are removed.
v.nom19.sim <- gsub("\\<[a-z]{1}\\.", "", v.nom19.sim)

All matching indeces are now stored in vectors, one for each gender.

# Storing of all matching indeces in vectors grouped by Gender: MAsculine and FEminine.
v.nom.gma <- which(v.nom19.sim %in% v.name.mas)
v.nom.gfe <- which(v.nom19.sim %in% v.name.fem)

In order to store all names and their corresponding gender a data frame is created. Column gender is first empty.

# Creation of a data frame with all the clean names stored in object v.nom19 and a gender column, empty by now.
df.nom.gen <- data.frame("name" = v.nom19, "gender" = NA)

Using the stored matching indeces genders are inserted to the data frame.

# Insertion of all assigned genders.
df.nom.gen$gender[v.nom.gma] <- "M"
df.nom.gen$gender[v.nom.gfe] <- "F"

Each row has now a gender or, when not assigned, a NA value.

3.2 Observational variables

3.2.1 Apellidos en el original

Any altered family name should be stored in this variable except for those whose transformation has been a minor casing and/or spacing change, or the deletion of control characters.

First, a copy of the original family names’ vector is created.

# Duplication of object v.ape00.
v.ape.obs00 <- v.ape00

The new vector is then cleansed to exclude the previously mentioned exceptions and the strings are all converted to upper-case. A copy of the fully cleansed family names’ vector is also created and its strings are all converted to upper-case too.

# Deletion of all control characters and extra spaces.
v.ape.obs01 <- gsub("[[:cntrl:]]+", "", v.ape00 )
v.ape.obs02 <- gsub("\\s{2,}", " ", v.ape.obs01)
v.ape.obs03 <- str_trim(v.ape.obs02)

# Upper-casing of both family names' vectors' strings.
v.ape.obs04 <- str_to_upper(v.ape.obs03)
v.ape20.upp <- str_to_upper(v.ape20)

The indeces of object v.ape.obs03 (which still conserves its original casing) where v.ape.obs04 and v.ape20.upp match are all emptied.

# Transformation of all values equal to those from object v.ape20 to NA.
v.ape.obs03[which(v.ape.obs04 == v.ape2