This unified dataset serves as the foundation for training and evaluating machine learning (ML) and deep learning (DL) models aimed at predicting antimicrobial resistance patterns.
Author: Evelia Coss
Date: 1/July/2025
Input files:
complete_metadata.RData: This RData file contains a consolidated set of key objects used in workflows for antimicrobial resistance analysis and sequencing metadata. It supports smooth transitions between preprocessing, integration, and analysis stages. See complete details in Chapter 2.
accession, genus, species, antibiotic - Sample metadata. Note: some species annotations may contain inconsistencies due to accession-level discrepancies.
phenotype, measurement_value - Prediction targets for modeling.
categoria_recodificada- Proposed standardized MIC (Minimum Inhibitory Concentration) value for each sample (old version, 2024).
columns starting with 300… - feature data. Contains counts of genes and SNPs conferring antibiotic resistance. Columns are labeled using ARO ids (see Section 3 for details). SNP features are labeled with the ARO id, followed by a dash (-) and the amino acid substitution.
Warning
No hay resultados para 42 accession. Más información en el STEP 3
Output files:
training_and_test_inputfile_cleaned.tsv.gz: Cleaned dataset (training ans test) containing standardized species names, reassigned MIC values (recategorized_mic), and curated phenotypes (assign_phenotype). This file is formatted and ready for input into machine learning (ML) and deep learning (DL) workflows. It includes the following columns:
genus, species, accession, phenotype, antibiotic,measurement_value -Sample metadata with corrected species information. Source:training_db_cleaned
recategorized_MIC- Updated (2025) standardized Minimum Inhibitory Concentration (MIC) values for each sample.
assign_phenotype –Reassigned phenotype labels, limited to “Resistant” and “Susceptible” categories.
type: Dataset classification between training and test datasets.
columns starting with 300… - feature data. Contains counts of genes and SNPs conferring antibiotic resistance. Columns are labeled using ARO ids (see Section 3 for details). SNP features are labeled with the ARO id, followed by a dash (-) and the amino acid substitution.
Funcitions
process_mic_data(): This function processes a .tsv file containing MIC (Minimum Inhibitory Concentration) data and assigns a phenotype (Susceptible or Resistant) based on bacterial genus and recategorized MIC values. It applies quality control filters and includes defensive checks. Required Input Columns: measurement_value, accession, new_genus, phenotype. Supported Genera: Klebsiella, Escherichia, Salmonella, Streptococcus, Staphylococcus, Pseudomonas, Acinetobacter, Campylobacter, and Neisseria.
When phenotype_assigned is NA
If either new_genus or recategorized MIC is missing
If the genus is not defined in the interpretation table
If the MIC doesn’t fall within any of the defined bins
3.0.1 Import Data
Code
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Code
library(janitor)
Attaching package: 'janitor'
The following objects are masked from 'package:stats':
chisq.test, fisher.test
Code
library(here)
here() starts at /Users/ecoss/Documents/CAMDA2025_metadatos
Code
# > training and test database# Load metadata cleaned and antibiogram informationload(here("rawdata/TrainAndTest_cleaned", "complete_metadata.RData"))# antibiograms_db, test_db_cleaned, training_db_cleaned, sra_metadata_db, test_completeInfo_db, training_completeInfo_db# > MIC values#new_mic_db <- read_csv(file= here("rawdata/TrainAndTest_cleaned","CAMDA25_training_con_MIC_y_fenotipo_1.csv")) %>%# Standardizes column names: converts them to lowercase,# replaces spaces and special characters with underscores# janitor::clean_names() #Rows: 5420 Columns: 8# renombrar columnas#names(new_mic_db)[names(new_mic_db) == "new_genus"] <- "new_species"# > gene family matrix (strict)# Read the .csv file into Rrgi_results <-read_tsv(file=here("rawdata/rgi_analysis","2025_Training_and_testing_strict.tsv.gz")) %>%# Standardizes column names: converts them to lowercase,# replaces spaces and special characters with underscores janitor::clean_names()
Rows: 9471 Columns: 1076
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (6): genus, species, accession, phenotype, antibiotic, measurement_value
dbl (1070): categoria_recodificada, ARO3000464-A121D, ARO3000464-G120K, ARO3...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
training_indir <-here("rawdata/TrainAndTest_cleaned", "training_metadata_cleaned.tsv")training_outdir <-here("rawdata/TrainAndTest_cleaned", "training_metadata_cleaned_new_mic.tsv")# Genera un nuevo mic y fenotipo por especie, ademas de almacenar el archivo en TSVtraining_new_mic_db <-process_mic_data(training_indir, training_outdir, max_mic =64)head(training_new_mic_db)
# Verificar las filas que dan NAtraining_and_test_inputfile_complete %>%filter(type =="training") %>%filter(is.na(recategorized_mic)) %>% DT::datatable()
Warning in instance$preRenderHook(instance): It seems your data is too big for
client-side DataTables. You may consider server-side processing:
https://rstudio.github.io/DT/server.html
3.4 Datos faltantes de RGI
Cuantas accesiones si tienes y no tienen resultados en RGI por accession:
Code
# Accession únicas en cada datasetaccession_completo <- training_and_test_inputfile_complete %>%distinct(accession)accession_con_rgi <- rgi_results %>%distinct(accession)# Unir y clasificarresumen_rgi <- accession_completo %>%mutate(has_rgi = accession %in% accession_con_rgi$accession) resumen_rgi_number <- resumen_rgi %>%count(has_rgi)resumen_rgi_number
# A tibble: 2 × 2
has_rgi n
<lgl> <int>
1 FALSE 42
2 TRUE 9462
De cuales especies no tenemos esta informacion
Code
# Obtener los SRA sin IDstraining_sin_resultados_IDs <- resumen_rgi %>%filter(has_rgi =="FALSE")training_sin_resultados_IDs <- training_sin_resultados_IDs$accession# lista de acceesiones sin informacion training_sin_resultados_db <- training_and_test_inputfile_complete %>%filter(accession %in% training_sin_resultados_IDs) training_sin_resultados_db %>% DT::datatable()
Los resultados de RGI no tienen resultados para 42 accession IDs.
3.5 Comparar archivsos
Code
# v1 de la funcionTrainAndTest_input_v1_db <-read_tsv(file =here("rawdata/TrainAndTest_cleaned", "training_and_test_inputfile_cleaned.tsv.gz"))
Rows: 9610 Columns: 1081
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (10): genus, species, scientific_name_new, accession, genome, phenotyp...
dbl (1071): ani, recategorized_mic, aro3000464_a121d, aro3000464_g120k, aro3...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
# Rows: 9610 Columns: 1081# v2 de la funcionTrainAndTest_input_v2_db <-read_tsv(file =here("rawdata/TrainAndTest_cleaned", "training_and_test_inputfile_cleaned_v2.tsv.gz"))
Rows: 9610 Columns: 1081
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (10): genus, species, scientific_name_new, accession, genome, phenotyp...
dbl (1071): ani, recategorized_mic, aro3000464_a121d, aro3000464_g120k, aro3...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.