3  Generating the Input File for Machine Learning and Deep Learning Models

Aim: To prepare the dataset for downstream predictive modeling, we merged three key sources of information:

  • The cleaned training and test database (training_db_cleaned and test_db_cleaned from complete_metadata.RData)
  • A new dataset containing assigned MIC values for additional isolates (CAMDA25_training_con_MIC_y_fenotipo_1.csv)
  • The gene family matrix derived from the RGI analysis (2025_Training_and_testing_strict.tsv.gz)

This unified dataset serves as the foundation for training and evaluating machine learning (ML) and deep learning (DL) models aimed at predicting antimicrobial resistance patterns.

  • Author: Evelia Coss
  • Date: 1/July/2025
  • Input files:
    • complete_metadata.RData: This RData file contains a consolidated set of key objects used in workflows for antimicrobial resistance analysis and sequencing metadata. It supports smooth transitions between preprocessing, integration, and analysis stages. See complete details in Chapter 2.
    • 2025_Training_and_testing_strict.tsv.gz: Table containing results from RGI analysis using training and test datasets. This file originates from Dataset2025. It includes the following columns:
      • accession, genus, species, antibiotic - Sample metadata. Note: some species annotations may contain inconsistencies due to accession-level discrepancies.

      • phenotype, measurement_value - Prediction targets for modeling.

      • categoria_recodificada- Proposed standardized MIC (Minimum Inhibitory Concentration) value for each sample (old version, 2024).

      • columns starting with 300… - feature data. Contains counts of genes and SNPs conferring antibiotic resistance. Columns are labeled using ARO ids (see Section 3 for details). SNP features are labeled with the ARO id, followed by a dash (-) and the amino acid substitution.

      • Warning

        No hay resultados para 42 accession. Más información en el STEP 3

  • Output files:
    • training_and_test_inputfile_cleaned.tsv.gz: Cleaned dataset (training ans test) containing standardized species names, reassigned MIC values (recategorized_mic), and curated phenotypes (assign_phenotype). This file is formatted and ready for input into machine learning (ML) and deep learning (DL) workflows. It includes the following columns:
      • genus, species, accession, phenotype, antibiotic, measurement_value -Sample metadata with corrected species information. Source: training_db_cleaned
      • recategorized_MIC- Updated (2025) standardized Minimum Inhibitory Concentration (MIC) values for each sample.
      • assign_phenotype –Reassigned phenotype labels, limited to “Resistant” and “Susceptible” categories.
      • type: Dataset classification between training and test datasets.
      • columns starting with 300… - feature data. Contains counts of genes and SNPs conferring antibiotic resistance. Columns are labeled using ARO ids (see Section 3 for details). SNP features are labeled with the ARO id, followed by a dash (-) and the amino acid substitution.
  • Funcitions
    • process_mic_data(): This function processes a .tsv file containing MIC (Minimum Inhibitory Concentration) data and assigns a phenotype (Susceptible or Resistant) based on bacterial genus and recategorized MIC values. It applies quality control filters and includes defensive checks. Required Input Columns: measurement_value, accession, new_genus, phenotype. Supported Genera: Klebsiella, Escherichia, Salmonella, Streptococcus, Staphylococcus, Pseudomonas, Acinetobacter, Campylobacter, and Neisseria.

    • When phenotype_assigned is NA
      • If either new_genus or recategorized MIC is missing

      • If the genus is not defined in the interpretation table

      • If the MIC doesn’t fall within any of the defined bins

3.0.1 Import Data

Code
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Code
library(janitor)

Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test
Code
library(here)
here() starts at /Users/ecoss/Documents/CAMDA2025_metadatos
Code
# > training and test database
# Load metadata cleaned and antibiogram information
load(here("rawdata/TrainAndTest_cleaned", "complete_metadata.RData"))
# antibiograms_db, test_db_cleaned, training_db_cleaned, sra_metadata_db, test_completeInfo_db,  training_completeInfo_db

# > MIC values
#new_mic_db <- read_csv(file= here("rawdata/TrainAndTest_cleaned","CAMDA25_training_con_MIC_y_fenotipo_1.csv")) %>%
   # Standardizes column names: converts them to lowercase,
  # replaces spaces and special characters with underscores
 # janitor::clean_names() 
#Rows: 5420 Columns: 8
# renombrar columnas
#names(new_mic_db)[names(new_mic_db) == "new_genus"] <- "new_species"

# > gene family matrix (strict)

# Read the .csv file into R
rgi_results <- read_tsv(file= here("rawdata/rgi_analysis","2025_Training_and_testing_strict.tsv.gz")) %>%
   # Standardizes column names: converts them to lowercase,
  # replaces spaces and special characters with underscores
  janitor::clean_names() 
Rows: 9471 Columns: 1076
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr    (6): genus, species, accession, phenotype, antibiotic, measurement_value
dbl (1070): categoria_recodificada, ARO3000464-A121D, ARO3000464-G120K, ARO3...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
# Rows: 9471 Columns: 1076

colnames(rgi_results)[colSums(is.na(rgi_results)) > 0]
[1] "measurement_value"      "categoria_recodificada"

3.1 STEP 1. Asignar nuevo valor de MIC y fenotipo por genero

Code
source(here("functions", "process_mic_data_function_v3.R"))
Code
training_indir <- here("rawdata/TrainAndTest_cleaned", "training_metadata_cleaned.tsv")
training_outdir <- here("rawdata/TrainAndTest_cleaned", "training_metadata_cleaned_new_mic.tsv")

# Genera un nuevo mic y fenotipo por especie, ademas de almacenar el archivo en TSV
training_new_mic_db <- process_mic_data(training_indir, training_outdir, max_mic = 64)

head(training_new_mic_db)
# A tibble: 6 × 12
  genus     species    scientific_name_new accession genome phenotype antibiotic
  <chr>     <chr>      <chr>               <chr>     <chr>  <chr>     <chr>     
1 Neisseria gonorrhoe… Neisseria gonorrho… SRR16612… ENA_S… Suscepti… TET       
2 Neisseria gonorrhoe… Neisseria gonorrho… SRR58271… ENA_S… Suscepti… TET       
3 Neisseria gonorrhoe… Neisseria gonorrho… SRR58271… ENA_S… Suscepti… TET       
4 Neisseria gonorrhoe… Neisseria gonorrho… SRR58270… ENA_S… Suscepti… TET       
5 Neisseria gonorrhoe… Neisseria gonorrho… SRR58272… ENA_S… Suscepti… TET       
6 Neisseria gonorrhoe… Neisseria gonorrho… SRR58273… ENA_S… Suscepti… TET       
# ℹ 5 more variables: measurement_value <dbl>, ani <dbl>, new_genus <chr>,
#   recategorized_mic <fct>, phenotype_assigned <chr>
Code
dim(training_new_mic_db)
[1] 5720   12

433 valores NA en ‘measurement_value’

Analizar la informacion de Intermediate

Code
filter(training_new_mic_db, phenotype == "Intermediate") %>% head()
# A tibble: 6 × 12
  genus     species    scientific_name_new accession genome phenotype antibiotic
  <chr>     <chr>      <chr>               <chr>     <chr>  <chr>     <chr>     
1 Neisseria gonorrhoe… Neisseria gonorrho… SRR58270… ENA_S… Intermed… TET       
2 Neisseria gonorrhoe… Neisseria gonorrho… ERR191803 ENA_S… Intermed… TET       
3 Neisseria gonorrhoe… Neisseria gonorrho… ERR350030 ENA_S… Intermed… TET       
4 Neisseria gonorrhoe… Neisseria gonorrho… ERR223681 ENA_S… Intermed… TET       
5 Neisseria gonorrhoe… Neisseria gonorrho… ERR449502 ENA_S… Intermed… TET       
6 Neisseria gonorrhoe… Neisseria gonorrho… ERR363613 ENA_S… Intermed… TET       
# ℹ 5 more variables: measurement_value <dbl>, ani <dbl>, new_genus <chr>,
#   recategorized_mic <fct>, phenotype_assigned <chr>

3.2 STEP 2. Unir archivos de test y training

Join information according with metadata

Code
# 1. Remove columns from testing
test_inputfile <- test_db_cleaned 
test_inputfile$type <- "test"
test_inputfile$phenotype_assigned <- 0
test_inputfile$recategorized_mic <- 0
dim(test_inputfile) # [1] 4055    12
[1] 4055   12
Code
# Unificar orden de las columnas
global_cols <- colnames(test_inputfile)

# 2. Unique IDs
# Join New MIC values (accession, phenotype_assigned, mic_new)
training_inputfile <- training_new_mic_db %>%
  select(-new_genus) 
training_inputfile$type <- "training"
# reordenar las columnas
training_inputfile <- training_inputfile %>% 
  select(any_of(global_cols))
dim(training_inputfile) # [1] 5720    12
[1] 5720   12
Code
length(unique(training_new_mic_db$accession)) 
[1] 5449
Code
# [1] 5449

# 4. Join files 
training_and_test_inputfile <- rbind(training_inputfile, test_inputfile)
Warning in `[<-.factor`(`*tmp*`, ri, value = c(0, 0, 0, 0, 0, 0, 0, 0, 0, :
invalid factor level, NA generated
Code
# eliminar duplicados
training_and_test_inputfile <- training_and_test_inputfile %>% distinct()
table(training_and_test_inputfile$type)

    test training 
    4055     5555 
Code
# test training 
# 4055    5555 #9775

dim(training_and_test_inputfile)
[1] 9610   12
Code
# [1] 9610   12

length(unique(training_and_test_inputfile$accession))# 9504 = 4055+5449
[1] 9504
Code
# Join with RGI results
training_and_test_inputfile_complete <- training_and_test_inputfile %>% 
  # Join rgi results
  left_join(select(rgi_results, -genus, -species, -phenotype, 
                   -antibiotic, -measurement_value, -categoria_recodificada), by = "accession") %>% 
  mutate(antibiotic = if_else( antibiotic == "tetracycline", "TET", antibiotic))

# unique(training_and_test_inputfile$antibiotic)
# [1] "TET" "ERY" "GEN" "CAZ"

# Dimensions
dim(training_and_test_inputfile_complete) # 9610 1081
[1] 9610 1081
Code
length(unique(training_and_test_inputfile_complete$accession)) #9504
[1] 9504

colnames

Code
colnames(training_and_test_inputfile)
 [1] "genus"               "species"             "scientific_name_new"
 [4] "accession"           "genome"              "phenotype"          
 [7] "antibiotic"          "measurement_value"   "ani"                
[10] "type"                "phenotype_assigned"  "recategorized_mic"  
Code
str(training_and_test_inputfile)
tibble [9,610 × 12] (S3: tbl_df/tbl/data.frame)
 $ genus              : chr [1:9610] "Neisseria" "Neisseria" "Neisseria" "Neisseria" ...
 $ species            : chr [1:9610] "gonorrhoeae" "gonorrhoeae" "gonorrhoeae" "gonorrhoeae" ...
 $ scientific_name_new: chr [1:9610] "Neisseria gonorrhoeae" "Neisseria gonorrhoeae" "Neisseria gonorrhoeae" "Neisseria gonorrhoeae" ...
 $ accession          : chr [1:9610] "SRR1661238" "SRR5827180" "SRR5827125" "SRR5827026" ...
 $ genome             : chr [1:9610] "ENA_SAMN03201584" "ENA_SAMN07351025" "ENA_SAMN07351011" "ENA_SAMN07351174" ...
 $ phenotype          : chr [1:9610] "Susceptible" "Susceptible" "Susceptible" "Susceptible" ...
 $ antibiotic         : chr [1:9610] "TET" "TET" "TET" "TET" ...
 $ measurement_value  : chr [1:9610] "0.25" "0.12" "0.12" "0.12" ...
 $ ani                : num [1:9610] 0.998 0.997 0.997 0.998 0.997 0.998 0.997 0.997 0.997 0.997 ...
 $ type               : chr [1:9610] "training" "training" "training" "training" ...
 $ phenotype_assigned : chr [1:9610] "Susceptible" "Susceptible" "Susceptible" "Susceptible" ...
 $ recategorized_mic  : Factor w/ 11 levels "0.06","0.12",..: 3 2 2 2 3 2 1 3 1 2 ...

Cambiar NA por ceros, excepto de algunas columnas

Code
training_and_test_inputfile_complete <- training_and_test_inputfile_complete %>%
  mutate(across(
    .cols = -c(genome, measurement_value, ani, phenotype_assigned, recategorized_mic),
    .fns  = ~ ifelse(is.na(.), 0, .)
  ))
Code
head(training_and_test_inputfile_complete) %>% 
  DT::datatable()

Save file

Code
write_tsv(training_and_test_inputfile_complete, file = here("rawdata/TrainAndTest_cleaned", "training_and_test_inputfile_cleaned_v2.tsv.gz"))

Checar columnas con NA

Code
colnames(training_and_test_inputfile_complete)[colSums(is.na(training_and_test_inputfile_complete)) > 0]
[1] "genome"            "measurement_value" "ani"              
[4] "recategorized_mic"

Checar duplicados

Code
training_duplicadas <- training_and_test_inputfile[duplicated(training_and_test_inputfile), ]

training_duplicadas %>% 
  filter(type == "training") 
# A tibble: 0 × 12
# ℹ 12 variables: genus <chr>, species <chr>, scientific_name_new <chr>,
#   accession <chr>, genome <chr>, phenotype <chr>, antibiotic <chr>,
#   measurement_value <chr>, ani <dbl>, type <chr>, phenotype_assigned <chr>,
#   recategorized_mic <fct>

3.3 STEP 3. Verificaciones en el archivo de training

Verificar que training no tenga NA

Code
# Detectar columnas con NA
training_and_test_inputfile_complete %>%
  filter(type == "training") %>%
  summarise(across(everything(), ~ sum(is.na(.)))) %>%
  select(where(~ . > 0)) %>%
  names()
[1] "genome"            "measurement_value" "ani"              
[4] "recategorized_mic"
Code
# Que valores arroja
unique(training_and_test_inputfile_complete$recategorized_mic)
 [1] 0.25 0.12 0.06 4    2    16   32   64   8    0.5  1    <NA>
Levels: 0.06 0.12 0.25 0.5 1 2 4 8 16 32 64
Code
# Verificar las filas que dan NA
training_and_test_inputfile_complete %>%
  filter(type == "training") %>%
  filter(is.na(recategorized_mic)) %>%
  DT::datatable()
Warning in instance$preRenderHook(instance): It seems your data is too big for
client-side DataTables. You may consider server-side processing:
https://rstudio.github.io/DT/server.html

3.4 Datos faltantes de RGI

Cuantas accesiones si tienes y no tienen resultados en RGI por accession:

Code
# Accession únicas en cada dataset
accession_completo <- training_and_test_inputfile_complete %>% 
  distinct(accession)

accession_con_rgi <- rgi_results %>% 
  distinct(accession)

# Unir y clasificar
resumen_rgi <- accession_completo %>%
  mutate(has_rgi = accession %in% accession_con_rgi$accession) 
resumen_rgi_number <- resumen_rgi %>%
  count(has_rgi)

resumen_rgi_number
# A tibble: 2 × 2
  has_rgi     n
  <lgl>   <int>
1 FALSE      42
2 TRUE     9462

De cuales especies no tenemos esta informacion

Code
# Obtener los SRA sin IDs
training_sin_resultados_IDs <- resumen_rgi %>% 
  filter(has_rgi == "FALSE")
training_sin_resultados_IDs <- training_sin_resultados_IDs$accession

# lista de acceesiones sin informacion 
training_sin_resultados_db <- training_and_test_inputfile_complete %>% 
  filter(accession %in% training_sin_resultados_IDs) 

training_sin_resultados_db %>% 
  DT::datatable()

Cuantas especies no tienen informacion

Code
table(training_sin_resultados_db$scientific_name_new)

 Acinetobacter baumannii     Campylobacter jejuni         Escherichia coli 
                      36                        3                        4 
   Klebsiella pneumoniae    Neisseria gonorrhoeae      Salmonella enterica 
                       1                        2                        2 
   Staphylococcus aureus Streptococcus pneumoniae 
                       7                       15 
Warning

Los resultados de RGI no tienen resultados para 42 accession IDs.

3.5 Comparar archivsos

Code
# v1 de la funcion
TrainAndTest_input_v1_db <- read_tsv(file = here("rawdata/TrainAndTest_cleaned", "training_and_test_inputfile_cleaned.tsv.gz"))
Rows: 9610 Columns: 1081
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr   (10): genus, species, scientific_name_new, accession, genome, phenotyp...
dbl (1071): ani, recategorized_mic, aro3000464_a121d, aro3000464_g120k, aro3...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
# Rows: 9610 Columns: 1081

# v2 de la funcion
TrainAndTest_input_v2_db <- read_tsv(file = here("rawdata/TrainAndTest_cleaned", "training_and_test_inputfile_cleaned_v2.tsv.gz"))
Rows: 9610 Columns: 1081
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr   (10): genus, species, scientific_name_new, accession, genome, phenotyp...
dbl (1071): ani, recategorized_mic, aro3000464_a121d, aro3000464_g120k, aro3...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
# Rows: 9610 Columns: 1081

Distribucion de especies

Code
# v1
unique(TrainAndTest_input_v1_db$recategorized_mic)
 [1]  0.25  0.12  0.06  4.00  2.00 16.00 32.00 64.00  8.00  0.50  1.00    NA
Code
unique(TrainAndTest_input_v1_db$phenotype_assigned)
[1] "Susceptible" "Resistant"   NA            "0"          
Code
# 2
unique(TrainAndTest_input_v2_db$recategorized_mic)
 [1]  0.25  0.12  0.06  4.00  2.00 16.00 32.00 64.00  8.00  0.50  1.00    NA
Code
unique(TrainAndTest_input_v2_db$phenotype_assigned)
[1] "Susceptible" "Resistant"   "0"          
Code
# ceros los test