2  Comparing CAMDA metadata related with public data

Aim: Find how many species are correctly annotated in both databases CAMDA and NCBI related with CAMDA information (training and test datasets).

  • Author: Evelia Coss
  • Date: 1/July/2025
  • Input files:
    • test set: CAMDA test set (test_metadata_db).
    • training set: CAMDA training set (training_metadata_db).
    • NCBI data: Metadata from NCBI (June 10, 2025) (sra_metadata_db)
    • Antibiograms: Metadata from NCBI, ENA and BV-BRC. (February 16, 2025) (antibiograms_db)
  • Output files:
    • testing_metadata_cleaned.tsv : File containing only the SRA data not shared with the training set. Located in rawdata/TrainAndTest_cleaned.
    • training_metadata_cleaned.tsv : Clean dataset with no missing data and no misclassified species. Species were also reassigned when necessary. See full explanation in STEP 5.
    • Training_allstatus_sp_reference_metadata.tsv.gz : Metadata table describing the content of public datasets (sra_metadata_db and antibiograms_db) and the training dataset. Includes two additional columns: status and status_reference. See complete details in STEP 4.
    • Test_allstatus_sp_reference_metadata.tsv.gz : Metadata table describing the content of public datasets (sra_metadata_db and antibiograms_db) and the testing dataset. Includes two additional columns: status and status_reference. See complete details in STEP 4.
    • complete_metadata.RData: This RData file contains a consolidated set of key objects used in workflows for antimicrobial resistance analysis and sequencing metadata. It supports smooth transitions between preprocessing, integration, and analysis stages. See complete details in STEP 5.
      • antibiograms_db: Antimicrobial susceptibility data for individual isolates, including phenotypic profiles and MIC values.
      • test_db_cleaned: Cleaned and preprocessed test dataset, prepared for downstream validation or analysis.
      • training_db_cleaned: Curated training dataset used for model development or exploratory analysis.
      • sra_metadata_db: Detailed metadata from the Sequence Read Archive (SRA), including project identifiers, sample information, and sequencing attributes.
      • test_completeInfo_db: Enriched version of the test dataset, integrating additional metadata such as phenotype, species, sample origin, and resource status information. Dataset related to Test_allstatus_sp_reference_metadata.tsv.gz.
      • training_cleanedInfo_db: Harmonized training dataset with integrated metadata, including phenotype, species, and resource status details, ready for modeling or comparative analysis. Dataset related to Training_allstatus_sp_reference_metadata.tsv.gz.
    • training_diff_phenotype_db.tsv: : Data on accessions that have more than one phenotype associated with the MIC value. See complete details in Section 3: Diverse phenotypes.
    • training_diff_phenotype_summary.tsv: Summary of the number of phenotypes assigned per species and accession. See complete details in Section 3: Diverse phenotypes.
  • Function:
    • compare_sp(): Performs a comparison and integration of species-level metadata across different sources (e.g., training_metadata_db, sra_metadata_db, and antibiograms_db). It ensures consistency and alignment of sample annotations by merging and validating key fields such as species identity, phenotypic labels, and associated metadata. The output is a harmonized tibble suitable for downstream analyses or quality control. See complete details in STEP 4.
    • compare_genus(): Check species name matches between CAMDA and public data across genus.

2.1 Load packages

Code
library(tidyverse)
library(DT)        # Tablas bonitas
library(here)

2.2 STEP 1. Import Metadata and change format

Code
# > training dataset
training_metadata <- read_csv(here("rawdata/TrainAndTest_dataset",  "training_dataset.csv"))
Rows: 6144 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (15): genus, species, accession, phenotype, antibiotic, measurement_sign...
dbl  (1): measurement_value

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
# rename species
training_metadata_db <- training_metadata %>%
  mutate(scientific_name_CAMDA = paste(genus, species, sep = " ")) 
# Rows: 6144 Columns: 16── Column

#NOTA: This file contain NA in measurement_value 

# > test dataset
test_metadata <- read_csv(here("rawdata/TrainAndTest_dataset",  "testing_template.csv"))
Rows: 5345 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): genus, species, accession, phenotype, antibiotic, measurement_value

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
# rename species
test_metadata_db <- test_metadata %>%
  mutate(scientific_name_CAMDA = paste(genus, species, sep = " ")) %>% 
  mutate(antibiotic = if_else(antibiotic == "tetracycline", "TET", antibiotic))
# Rows: 5345 Columns: 6── Column

unique(test_metadata_db$antibiotic)
[1] "GEN" "TET" "ERY" "CAZ"
Code
# [1] "GEN"          "TET"          "ERY"          "CAZ"          "tetracycline"
unique(test_metadata_db$antibiotic)
[1] "GEN" "TET" "ERY" "CAZ"
Code
# [1] "GEN" "TET" "ERY" "CAZ"

# > Load public metadata from NCBI
sra_metadata <- read_csv(here("metadata", "sra-metadata.csv")) 
Rows: 11510 Columns: 47
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (26): Run, AssemblyName, download_path, Experiment, LibraryName, Librar...
dbl  (10): spots, bases, spots_with_mates, avgLength, size_MB, InsertSize, I...
lgl   (9): g1k_pop_code, source, g1k_analysis_group, Sex, Disease, Affection...
dttm  (2): ReleaseDate, LoadDate

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
sra_metadata_db <- sra_metadata %>% janitor::clean_names() %>%  # para limpiar y estandarizar los nombres de columnas de un data frame
  select(run, library_strategy, library_selection, library_source, platform, model, bio_project, bio_sample, scientific_name) %>%
  # renombrar columnas
  rename(accession = run, scientific_name_complete = scientific_name)
# Rows: 11510 Columns: 47── Column specification

# Reduce name
# Quedarme con las primeras dos palabras
sra_metadata_db$scientific_name_NCBI <- word(sra_metadata_db$scientific_name_complete, 1, 2)

# > Antibiogram
# Combined antibiogram dataset from NCBI, ENA, and BV-BRC
# from: https://zenodo.org/records/14876710

# Define URL and local destination
anti_url <- "https://zenodo.org/record/15809334/files/antibiograms.tsv.zip"
destfile <- here("metadata", "antibiograms_v2.tsv.zip")
# Download the compressed file
download.file(url = anti_url, destfile = destfile, mode = "wb")
# Descomprimir el .zip
unzip(destfile, exdir = "metadata")
# Comprimir en zip
original_file <- here("metadata", "antibiograms.tsv")
gz_file <- here("metadata","antibiograms_v2.tsv.gz")
R.utils::gzip(original_file, destname = gz_file, overwrite = TRUE)

# Read the .tsv.gz file into R
# Date: February 16, 2025
# antibiograms_metadata
antibiograms_db <- readr::read_tsv(destfile) %>%
  # Standardizes column names: converts them to lowercase,
  # replaces spaces and special characters with underscores
  janitor::clean_names() %>%
  # Renames the columns
  rename(
    accession = reads,
    scientific_name_Antibiogram = species
  )
Rows: 1404850 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (14): biosample, sra_biosample, species, antibiotic, phenotype, measurem...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
# > Train dataset with GTDB
# Define URL and local destination
gtdb_train_url <- "https://raw.githubusercontent.com/ccm-bioinfo/CAMDA2025_AntibioticResistance/main/preprocessing/metadata/train.ani.csv"

destfile <- here("rawdata/gtdb_results", "train.ani.csv")

# Download the compressed file
download.file(url = gtdb_train_url, destfile = destfile, mode = "wb")

# Read the .csv file into R
train_gtdb_result <- readr::read_csv(destfile)
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)
Rows: 5752 Columns: 19
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (16): old_species, new_species, accession, phenotype, antibiotic, measur...
dbl  (3): ani, measurement_value, publication

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
# [1] 5752   19
# Falta eliminar los generos diferentes

# Sin accession repetidos
# all(duplicated(train_gtdb_result$accession))

# > Test dataset with GTDB
# Define URL and local destination
gtdb_test_url <- "https://raw.githubusercontent.com/ccm-bioinfo/CAMDA2025_AntibioticResistance/main/preprocessing/metadata/test.ani.csv"

destfile <- here("rawdata/gtdb_results", "test.ani.csv")

# Download the compressed file
download.file(url = gtdb_test_url, destfile = destfile, mode = "wb")

# Read the .csv file into R
test_gtdb_result <- readr::read_csv(destfile)
Rows: 5345 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): old_species, new_species, accession, phenotype, antibiotic, measure...
dbl (1): ani

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
# [1] 5345    9

Distribution of SRA IDs

Code
# Count unique accessions per species in the training metadata
train_counts <- training_metadata_db %>%
  group_by(scientific_name_CAMDA) %>%
  summarise(train_num_accessions = n_distinct(accession))

# Count unique accessions per species in the testing metadata
test_counts <- test_metadata_db %>%
  group_by(scientific_name_CAMDA) %>%
  summarise(test_num_accessions = n_distinct(accession))

# Combine the counts by species, keeping all species present in the training set
combined_counts <- train_counts %>%
  left_join(test_counts, by = "scientific_name_CAMDA") %>%
  # Replace NA in testing counts with 0 for species not present in the testing data
  mutate(test_num_accessions = ifelse(is.na(test_num_accessions), 0, test_num_accessions))

# Display the combined counts table
combined_counts
# A tibble: 9 × 3
  scientific_name_CAMDA    train_num_accessions test_num_accessions
  <chr>                                   <int>               <int>
1 Acinetobacter baumannii                   488                 597
2 Campylobacter jejuni                      537                 410
3 Escherichia coli                          512                 510
4 Klebsiella pneumoniae                     791                 700
5 Neisseria gonorrhoeae                     741                 679
6 Pseudomonas aeruginosa                    571                 548
7 Salmonella enterica                       640                 700
8 Staphylococcus aureus                     571                 597
9 Streptococcus pneumoniae                  607                 604
Important

These tables may contain duplicates and incorrect annotations in the species names.

2.2.1 Functions

2.2.1.1 Compare with public data

Code
source("functions/compare_sp_function.R")

The resulting dataset includes two status columns: status and status_reference:

  • status indicates how the species annotation from the CAMDA dataset compares to external sources:

    • Matched_NCBI_Antibiogram: CAMDA matches both NCBI and antibiogram data.

    • Matched_NCBI: CAMDA matches only NCBI data.

    • Matched_Antibiogram: CAMDA matches only the antibiogram data.

    • Missing_All: No sufficient external data available for comparison.

  • status_reference evaluates the consistency between the two external sources (NCBI and antibiogram):

    • Good_sources: NCBI and antibiogram annotations agree with each other.
    • Verify_sources: NCBI and antibiogram annotations disagree and require further manual verification to ensure data reliability.

When the species annotations provided by NCBI and the antibiogram metadata do not match, it raises concerns about the accuracy of the sample’s taxonomic identity. Such discrepancies may indicate mislabeling, contamination, or outdated records. Therefore, it is essential to verify the original sources to ensure the reliability of downstream analyses.

2.2.1.2 Check species name matches between CAMDA and NCBI

Code
source(here("functions", "compare_genus_function.R"))

Find how many species have matching names between CAMDA and NCBI to identify correctly annotated entries.

2.3 Check GTDB dataset

Join information

Code
gtdb_results <- rbind(
  # train
  train_gtdb_result %>% 
  select(old_species, new_species, accession, genome) ,
  # test
  test_gtdb_result %>% 
  select(old_species, new_species, accession, genome) 
)

dim(gtdb_results) # 5752+5345
[1] 11097     4

2.4 STEP 2. Cleaning metadata information

The information for the SRA IDs in the test and training datasets was verified:

  1. SRA IDs found in both test and training sets were removed from the test set, keeping them only in the training set.
  2. Remove the SRA entries that belong to different genus.

2.4.1 Exclude overlapping SRA IDs from the test set, retaining them only in the training set

We detected that 1,290 SRA IDs are shared between both files.

Code
testing_metadata_cleaned <- test_metadata_db %>% 
  # Remove the test IDs that are shared with the training set
  filter(!(accession %in% training_metadata_db$accession)) %>% # 1,290
  # Unir con la informacion de gtdb
  left_join(select(test_gtdb_result,
                   new_species, accession, genome, ani), by ="accession") %>%
  # redondear ANI
  mutate(ani = round(ani, 3))

#  4055 SRA IDs accessions by row
nrow(testing_metadata_cleaned) #4055
[1] 4055
Code
# Numero de accesiones
length(unique(testing_metadata_cleaned$accession))#4055
[1] 4055
Code
# Sin accession duplicados
all(duplicated(testing_metadata_cleaned$accession))
[1] FALSE
Code
# Una sola accesion por fila

# Sin ani con NA
all(is.na(testing_metadata_cleaned$ani))
[1] FALSE

Save file:

Code
# Create new folder
#dir.create("../rawdata/TrainAndTest_cleaned")
write_tsv(testing_metadata_cleaned, file = here("rawdata/TrainAndTest_cleaned", "testing_metadata_cleaned.tsv"), quote = "none")

Check information

Code
length(unique(testing_metadata_cleaned$accession))
[1] 4055
Code
#[1] 4055
length(unique(test_gtdb_result$accession)) #  1,290 SRA IDs are shared between both files.
[1] 5345
Code
#[1] 5345

# > Informacion de GTDB dataset
# Filtrar solo los test que no aparecen en training
test_gtdb_cleaned <- test_gtdb_result %>% 
  filter(accession %in% testing_metadata_cleaned$accession) %>% 
  # Redondear ANI
  mutate(ani = round(ani, 3))
dim(test_gtdb_cleaned)
[1] 4055    9
Code
# [1] 4055    9

2.4.1.1 Check species name matches between CAMDA and public data

Verificar especies y eliminar datos incorrectos

Code
# POR MODIFICAR
# Ya la debo de cambiar porque GTDB es mejor en los resultados que contiene
test_completeInfo_db <- compare_sp(testing_metadata_cleaned, sra_metadata_db, antibiograms_db)
dim(test_completeInfo_db)
[1] 5458    6
Code
#[1] 5458    6

# ---- scientific_name_new -------
# Detect different genus between species using GTDB dataset
test_completeinfo_cleaned_db <- compare_genus(testing_metadata_cleaned, old_col = "scientific_name_CAMDA", new_col = "new_species")
Extrayendo el genero de las especies
Determinando especies erroneas
Agregar clasificaciones
Code
table(test_completeinfo_cleaned_db$genus_match)

 TRUE FALSE    NA 
 3906     6   143 
Code
#  TRUE FALSE    NA 
#  3906     6   143 

# agregar nueva clasiifacion en los nombres de las especies
test_completeinfo_cleaned_db$scientific_name_new <- ifelse(
  # Si new_species es NA, mantiene scientific_name_CAMDA.
  is.na(test_completeinfo_cleaned_db$new_species),
  test_completeinfo_cleaned_db$scientific_name_CAMDA,
  # Si son iguales, conserva scientific_name_CAMDA.
  ifelse(
    test_completeinfo_cleaned_db$scientific_name_CAMDA == test_completeinfo_cleaned_db$new_species,
         test_completeinfo_cleaned_db$scientific_name_CAMDA,
         test_completeinfo_cleaned_db$new_species))

dim(test_completeinfo_cleaned_db)
[1] 4055   14
Code
# [1] 4055   14

Hay 6 SRA con problemas en la especie

Code
test_completeinfo_cleaned_db %>% 
  filter(genus_match == "FALSE")
# A tibble: 6 × 14
  genus         species     accession  phenotype antibiotic measurement_value
  <chr>         <chr>       <chr>      <chr>     <chr>      <chr>            
1 Klebsiella    pneumoniae  SRR5385772 ?         GEN        ?                
2 Neisseria     gonorrhoeae SRR3360755 ?         TET        ?                
3 Acinetobacter baumannii   SRR1180875 ?         CAZ        ?                
4 Acinetobacter baumannii   SRR1180930 ?         CAZ        ?                
5 Acinetobacter baumannii   SRR1187769 ?         CAZ        ?                
6 Campylobacter jejuni      ERR9766771 ?         TET        ?                
# ℹ 8 more variables: scientific_name_CAMDA <chr>, new_species <chr>,
#   genome <chr>, ani <dbl>, old_genus <chr>, new_genus <chr>,
#   genus_match <fct>, scientific_name_new <chr>

Unificar las columnas entre training y test

Code
# global information 
global_cols <- c("genus", "species", "scientific_name_new", "accession", "genome", "phenotype", "antibiotic", "measurement_value", "ani")

# test informacion
test_db_cleaned <- test_completeinfo_cleaned_db %>%
  select(any_of(global_cols))

Save data

Code
write_tsv(test_db_cleaned, file = here("rawdata/TrainAndTest_cleaned", "Test_allstatus_sp_reference_metadata.tsv.gz"), quote = "none")

2.5 Check training dataset - SRA duplicated

2.5.1 How many SRA IDs are duplicated within the training set?

Code
# How many unique that we have?
cat("Number of SRA ID unique (accessions)", length(unique(training_metadata$accession)), sep = " ") # 5458 SRA ID uniqued
Number of SRA ID unique (accessions) 5458
Code
# Numero de SRA IDs que se repiten varias veces?
training_metadata_duplicated  <- table(training_metadata$accession)
sum(training_metadata_duplicated > 1) # 502 SRA IDs
[1] 502
Code
# How many duplicated that we have?
training_duplicated_table <- training_metadata %>%
  group_by(accession) %>%
  filter(n() > 1) %>%   # filtra accession con más de una ocurrencia
  summarise(duplicados = n()) %>%
  arrange(desc(duplicados))

# Only IDs (502 SRA IDs)
training_SRAIDs_duplicated <- training_duplicated_table$accession

# Obtain duplicated names
training_duplicated_db <- training_metadata %>% 
  filter(accession %in% training_SRAIDs_duplicated)

2.6 STEP 3. Cleaning training dataset

Code
training_metadata_cleaned <- training_metadata_db %>% 
  # Unir con la informacion de gtdb
  left_join(select(train_gtdb_result,
                   new_species, accession, genome, ani), 
            by ="accession", relationship = "many-to-many") %>%
  # redondear ANI
  mutate(ani = round(ani, 3)) %>%
  # Eliminar filas duplicadas
  distinct()

#  5458 SRA IDs accessions by row
nrow(training_metadata_cleaned) #5729
[1] 5729
Code
# Numero de accesiones
length(unique(training_metadata_cleaned$accession))#5458
[1] 5458
Code
# Sin ani con NA
all(is.na(training_metadata_cleaned$ani))
[1] FALSE
Code
# Si hay filas con mas duplicados, accesiones con mas genomas

Checar IDs repetidos

Code
table_accession <- table(training_metadata_cleaned$accession)
repeated_accessions <- table_accession[table_accession > 1]
length(repeated_accessions)  # Cuántos accesiones se repiten, 220
[1] 220
Code
# Ver filas duplicadas completas
duplicated_rows <- training_metadata_cleaned %>%
  filter(accession %in% names(repeated_accessions)) %>%
  arrange(accession)

duplicated_rows %>%
  DT::datatable()

Check information

Code
length(unique(training_metadata_cleaned$accession))
[1] 5458
Code
#[1] 5458
length(unique(train_gtdb_result$accession)) #  
[1] 5458
Code
#[1] 5458

dim(train_gtdb_result) #5752
[1] 5752   19
Code
dim(training_metadata_cleaned) # 5729
[1] 5729   20

Limpieza de especies

Code
# POR MODIFICAR
# Ya la debo de cambiar porque GTDB es mejor en los resultados que contiene
training_completeInfo_db <- compare_sp(training_metadata_db, sra_metadata_db, antibiograms_db)
dim(training_completeInfo_db)
[1] 5458    6
Code
#[1] 5458    6

# ---- scientific_name_new -------
# Detect different genus between species using GTDB dataset
training_completeinfo_cleaned_db <- compare_genus(training_metadata_cleaned, old_col = "scientific_name_CAMDA", new_col = "new_species") %>%
  # Delete problems with species
   filter(genus_match != FALSE) 
Extrayendo el genero de las especies
Determinando especies erroneas
Agregar clasificaciones
Code
table(training_completeinfo_cleaned_db$genus_match)

 TRUE FALSE    NA 
 5237     0   483 
Code
#  TRUE FALSE    NA 
#  5497     9  1874 

# agregar nueva clasiifacion en los nombres de las especies
training_completeinfo_cleaned_db$scientific_name_new <- ifelse(
  # Si new_species es NA, mantiene scientific_name_CAMDA.
  is.na(training_completeinfo_cleaned_db$new_species),
  training_completeinfo_cleaned_db$scientific_name_CAMDA,
  # Si son iguales, conserva scientific_name_CAMDA.
  ifelse(
    training_completeinfo_cleaned_db$scientific_name_CAMDA == training_completeinfo_cleaned_db$new_species,
         training_completeinfo_cleaned_db$scientific_name_CAMDA,
         training_completeinfo_cleaned_db$new_species))

dim(training_completeinfo_cleaned_db)
[1] 5720   24
Code
# [1] 5720   12

# Global information
training_completeinfo_cleaned_db %>%  
  # Visualizar informacion en una tabla bonita
  DT::datatable()
Warning in instance$preRenderHook(instance): It seems your data is too big for
client-side DataTables. You may consider server-side processing:
https://rstudio.github.io/DT/server.html
Code
table(training_completeinfo_cleaned_db$scientific_name_new)

   Acinetobacter baumannii  Acinetobacter courvalinii 
                       564                          1 
Acinetobacter nosocomialis       Campylobacter jejuni 
                         3                        537 
          Escherichia coli        Klebsiella africana 
                       507                          1 
     Klebsiella pneumoniae Klebsiella quasipneumoniae 
                       767                         11 
      Klebsiella variicola      Neisseria gonorrhoeae 
                        12                        751 
    Pseudomonas aeruginosa        Salmonella enterica 
                       571                        715 
     Staphylococcus aureus   Streptococcus pneumoniae 
                       573                        707 

Unificar las columnas entre training y test

Code
columns_to_select <- c(global_cols, "old_genus", "new_genus")

# test informacion
training_db_cleaned <- training_completeinfo_cleaned_db %>%
  select(any_of(columns_to_select)) %>%
  # Si new_genus tiene NA, colocar la informacion de old_genus y si no, dejar igual
  mutate(new_genus = if_else(is.na(new_genus),old_genus, new_genus )) %>%
  select(-old_genus)

dim(training_db_cleaned)
[1] 5720   10

Save data

Code
write_tsv(training_completeinfo_cleaned_db, file = here("rawdata/TrainAndTest_cleaned", "Training_allstatus_sp_reference_metadata.tsv.gz"), quote = "none")

Aqui debo unir el archivo de nuevos MIC y fenotipos

Save data

Code
write_tsv(training_db_cleaned, file = here("rawdata/TrainAndTest_cleaned", "training_metadata_cleaned.tsv"), quote = "none")

2.7 About missing data

Code
# NO CORRER - POR EDITAR
training_completeinfo_cleaned_db %>%
  group_by(scientific_name_CAMDA) %>%
  summarise(
    total_samples = n(),
    n_missing_data = sum(status == "Missing_All"),
    pct_missing_data = (n_missing_data / total_samples) * 100) %>%
  arrange(desc(n_missing_data)) %>%
# Change NA by Zero
 mutate(across(everything(), ~replace_na(., 0)))
Important

7 out of 9 species have missing or incorrect data.

2.8 Verify information sources

The data from NCBI does not match the information available in the antibiogram records.

Code
training_verify_data <- training_db_cleaned %>%
  filter(status_reference == "Verify_sources")
Code
table(training_completeInfo_db$status_reference)

2.9 STEP 4: Test vs public data

Verify the total number of SRA IDs.

Code
# Count unique accessions per species in the training metadata
train_counts_cleaned <- training_db_cleaned %>%
  group_by(scientific_name_new) %>%
  summarise(trainClean_num_accessions = n_distinct(accession))

# Count unique accessions per species in the testing metadata
test_counts_cleaned <- test_db_cleaned %>%
  group_by(scientific_name_new) %>%
  summarise(testClean_num_accessions = n_distinct(accession))

# Combine the counts by species, keeping all species present in the training set
combined_counts_completed <- train_counts_cleaned %>%
  left_join(test_counts_cleaned, by = "scientific_name_new") 

# sum columns
df_total <- combined_counts_completed %>%
  summarise(across(where(is.numeric), ~sum(.x, na.rm = TRUE))) %>%
  mutate(rowname = "Total") %>%
  select(rowname, everything())

combined_counts_completed <- combined_counts_completed %>%
  mutate(rowname = rownames(.)) %>%
  select(rowname, everything()) %>%
  bind_rows(df_total)

# Display the combined counts table
combined_counts_completed %>% 
  select(-rowname) %>%
  DT::datatable()

Save

Code
save(antibiograms_db, test_db_cleaned, training_db_cleaned, sra_metadata_db, test_completeinfo_cleaned_db, training_completeinfo_cleaned_db,  file = here("rawdata/TrainAndTest_cleaned", "complete_metadata.RData"))

3 Diverse phenotypes

Code
# Visualize phenotype info for Acinetobacter baumannii
training_db_cleaned %>%
  group_by(scientific_name_new, phenotype) %>%
  summarise(count = n(), .groups = "drop") %>%
  pivot_wider(
    names_from = phenotype,
    values_from = count,
    values_fill = 0
  )
# A tibble: 14 × 4
   scientific_name_new        Intermediate Resistant Susceptible
   <chr>                             <int>     <int>       <int>
 1 Acinetobacter baumannii             146       248         170
 2 Acinetobacter courvalinii             0         0           1
 3 Acinetobacter nosocomialis            0         0           3
 4 Campylobacter jejuni                  0       326         211
 5 Escherichia coli                     19       148         340
 6 Klebsiella africana                   0         1           0
 7 Klebsiella pneumoniae               133       326         308
 8 Klebsiella quasipneumoniae            0         2           9
 9 Klebsiella variicola                  0         3           9
10 Neisseria gonorrhoeae               148       336         267
11 Pseudomonas aeruginosa               94       249         228
12 Salmonella enterica                  19       346         350
13 Staphylococcus aureus                46       222         305
14 Streptococcus pneumoniae             53       311         343

How Many SRA are repeated within each species but assigned to different phenotypes

Code
training_db_cleaned %>%
  group_by(scientific_name_new, accession) %>%
  summarise(phenotype_count = n_distinct(phenotype), .groups = "drop") %>%
  filter(phenotype_count > 1) %>%
  count(scientific_name_new, name = "num_conflicting_accessions")
# A tibble: 6 × 2
  scientific_name_new      num_conflicting_accessions
  <chr>                                         <int>
1 Acinetobacter baumannii                          32
2 Escherichia coli                                  1
3 Neisseria gonorrhoeae                             5
4 Salmonella enterica                               1
5 Staphylococcus aureus                             3
6 Streptococcus pneumoniae                          6

To break down the count of conflicting accessions per species and per phenotype, you can extend the logic by grouping further by phenotype after identifying accessions with multiple phenotypes.

Code
training_diff_phenotype_summary <- training_db_cleaned %>%
  group_by(scientific_name_new, accession) %>%
  filter(n_distinct(phenotype) > 1) %>%
  distinct(scientific_name_new, accession, phenotype) %>%
  count(scientific_name_new, phenotype, name = "num_conflicting_accessions_per_phenotype")

head(training_diff_phenotype_summary)
# A tibble: 6 × 4
# Groups:   scientific_name_new, accession [3]
  scientific_name_new     accession  phenotype    num_conflicting_accessions_p…¹
  <chr>                   <chr>      <chr>                                 <int>
1 Acinetobacter baumannii SRR2916633 Intermediate                              1
2 Acinetobacter baumannii SRR2916633 Resistant                                 1
3 Acinetobacter baumannii SRR3131162 Intermediate                              1
4 Acinetobacter baumannii SRR3131162 Resistant                                 1
5 Acinetobacter baumannii SRR3131165 Intermediate                              1
6 Acinetobacter baumannii SRR3131165 Resistant                                 1
# ℹ abbreviated name: ¹​num_conflicting_accessions_per_phenotype
Code
# Identify species with more than one assigned phenotype
training_diff_phenotype_db <- training_db_cleaned %>%
  filter(accession %in% training_diff_phenotype_summary$accession)
dim(training_diff_phenotype_db)# [1] 108 21
[1] 108  10

Save files

Code
write_tsv(training_diff_phenotype_db, file = here("rawdata/diff_phenotype","training_diff_phenotype_db.tsv"))
write_tsv(training_diff_phenotype_summary, file = here("rawdata/diff_phenotype", "training_diff_phenotype_summary.tsv"))