Code
library(tidyverse)
library(DT) # Tablas bonitas
library(here)
Aim: Find how many species are correctly annotated in both databases CAMDA and NCBI related with CAMDA information (training and test datasets).
- Author: Evelia Coss
- Date: 1/July/2025
- Input files:
- test set: CAMDA test set (
test_metadata_db
).- training set: CAMDA training set (
training_metadata_db
).- NCBI data: Metadata from NCBI (June 10, 2025) (
sra_metadata_db
)- Antibiograms: Metadata from NCBI, ENA and BV-BRC. (February 16, 2025) (
antibiograms_db
)- Output files:
testing_metadata_cleaned.tsv
: File containing only the SRA data not shared with the training set. Located inrawdata/TrainAndTest_cleaned
.training_metadata_cleaned.tsv
: Clean dataset with no missing data and no misclassified species. Species were also reassigned when necessary. See full explanation in STEP 5.Training_allstatus_sp_reference_metadata.tsv.gz
: Metadata table describing the content of public datasets (sra_metadata_db
andantibiograms_db
) and the training dataset. Includes two additional columns:status
andstatus_reference
. See complete details in STEP 4.Test_allstatus_sp_reference_metadata.tsv.gz
: Metadata table describing the content of public datasets (sra_metadata_db
andantibiograms_db
) and the testing dataset. Includes two additional columns:status
andstatus_reference
. See complete details in STEP 4.complete_metadata.RData
: This RData file contains a consolidated set of key objects used in workflows for antimicrobial resistance analysis and sequencing metadata. It supports smooth transitions between preprocessing, integration, and analysis stages. See complete details in STEP 5.
antibiograms_db
: Antimicrobial susceptibility data for individual isolates, including phenotypic profiles and MIC values.test_db_cleaned
: Cleaned and preprocessed test dataset, prepared for downstream validation or analysis.training_db_cleaned
: Curated training dataset used for model development or exploratory analysis.sra_metadata_db
: Detailed metadata from the Sequence Read Archive (SRA), including project identifiers, sample information, and sequencing attributes.test_completeInfo_db
: Enriched version of the test dataset, integrating additional metadata such as phenotype, species, sample origin, and resource status information. Dataset related toTest_allstatus_sp_reference_metadata.tsv.gz.
training_cleanedInfo_db
: Harmonized training dataset with integrated metadata, including phenotype, species, and resource status details, ready for modeling or comparative analysis. Dataset related toTraining_allstatus_sp_reference_metadata.tsv.gz.
training_diff_phenotype_db.tsv
: : Data on accessions that have more than one phenotype associated with the MIC value. See complete details in Section 3: Diverse phenotypes.training_diff_phenotype_summary.tsv
: Summary of the number of phenotypes assigned per species and accession. See complete details in Section 3: Diverse phenotypes.- Function:
compare_sp()
: Performs a comparison and integration of species-level metadata across different sources (e.g.,training_metadata_db
,sra_metadata_db
, andantibiograms_db
). It ensures consistency and alignment of sample annotations by merging and validating key fields such as species identity, phenotypic labels, and associated metadata. The output is a harmonizedtibble
suitable for downstream analyses or quality control. See complete details in STEP 4.compare_genus()
: Check species name matches between CAMDA and public data across genus.
library(tidyverse)
library(DT) # Tablas bonitas
library(here)
# > training dataset
<- read_csv(here("rawdata/TrainAndTest_dataset", "training_dataset.csv")) training_metadata
Rows: 6144 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (15): genus, species, accession, phenotype, antibiotic, measurement_sign...
dbl (1): measurement_value
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# rename species
<- training_metadata %>%
training_metadata_db mutate(scientific_name_CAMDA = paste(genus, species, sep = " "))
# Rows: 6144 Columns: 16── Column
#NOTA: This file contain NA in measurement_value
# > test dataset
<- read_csv(here("rawdata/TrainAndTest_dataset", "testing_template.csv")) test_metadata
Rows: 5345 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): genus, species, accession, phenotype, antibiotic, measurement_value
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# rename species
<- test_metadata %>%
test_metadata_db mutate(scientific_name_CAMDA = paste(genus, species, sep = " ")) %>%
mutate(antibiotic = if_else(antibiotic == "tetracycline", "TET", antibiotic))
# Rows: 5345 Columns: 6── Column
unique(test_metadata_db$antibiotic)
[1] "GEN" "TET" "ERY" "CAZ"
# [1] "GEN" "TET" "ERY" "CAZ" "tetracycline"
unique(test_metadata_db$antibiotic)
[1] "GEN" "TET" "ERY" "CAZ"
# [1] "GEN" "TET" "ERY" "CAZ"
# > Load public metadata from NCBI
<- read_csv(here("metadata", "sra-metadata.csv")) sra_metadata
Rows: 11510 Columns: 47
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (26): Run, AssemblyName, download_path, Experiment, LibraryName, Librar...
dbl (10): spots, bases, spots_with_mates, avgLength, size_MB, InsertSize, I...
lgl (9): g1k_pop_code, source, g1k_analysis_group, Sex, Disease, Affection...
dttm (2): ReleaseDate, LoadDate
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
<- sra_metadata %>% janitor::clean_names() %>% # para limpiar y estandarizar los nombres de columnas de un data frame
sra_metadata_db select(run, library_strategy, library_selection, library_source, platform, model, bio_project, bio_sample, scientific_name) %>%
# renombrar columnas
rename(accession = run, scientific_name_complete = scientific_name)
# Rows: 11510 Columns: 47── Column specification
# Reduce name
# Quedarme con las primeras dos palabras
$scientific_name_NCBI <- word(sra_metadata_db$scientific_name_complete, 1, 2)
sra_metadata_db
# > Antibiogram
# Combined antibiogram dataset from NCBI, ENA, and BV-BRC
# from: https://zenodo.org/records/14876710
# Define URL and local destination
<- "https://zenodo.org/record/15809334/files/antibiograms.tsv.zip"
anti_url <- here("metadata", "antibiograms_v2.tsv.zip")
destfile # Download the compressed file
download.file(url = anti_url, destfile = destfile, mode = "wb")
# Descomprimir el .zip
unzip(destfile, exdir = "metadata")
# Comprimir en zip
<- here("metadata", "antibiograms.tsv")
original_file <- here("metadata","antibiograms_v2.tsv.gz")
gz_file ::gzip(original_file, destname = gz_file, overwrite = TRUE)
R.utils
# Read the .tsv.gz file into R
# Date: February 16, 2025
# antibiograms_metadata
<- readr::read_tsv(destfile) %>%
antibiograms_db # Standardizes column names: converts them to lowercase,
# replaces spaces and special characters with underscores
::clean_names() %>%
janitor# Renames the columns
rename(
accession = reads,
scientific_name_Antibiogram = species
)
Rows: 1404850 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (14): biosample, sra_biosample, species, antibiotic, phenotype, measurem...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# > Train dataset with GTDB
# Define URL and local destination
<- "https://raw.githubusercontent.com/ccm-bioinfo/CAMDA2025_AntibioticResistance/main/preprocessing/metadata/train.ani.csv"
gtdb_train_url
<- here("rawdata/gtdb_results", "train.ani.csv")
destfile
# Download the compressed file
download.file(url = gtdb_train_url, destfile = destfile, mode = "wb")
# Read the .csv file into R
<- readr::read_csv(destfile) train_gtdb_result
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
dat <- vroom(...)
problems(dat)
Rows: 5752 Columns: 19
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (16): old_species, new_species, accession, phenotype, antibiotic, measur...
dbl (3): ani, measurement_value, publication
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# [1] 5752 19
# Falta eliminar los generos diferentes
# Sin accession repetidos
# all(duplicated(train_gtdb_result$accession))
# > Test dataset with GTDB
# Define URL and local destination
<- "https://raw.githubusercontent.com/ccm-bioinfo/CAMDA2025_AntibioticResistance/main/preprocessing/metadata/test.ani.csv"
gtdb_test_url
<- here("rawdata/gtdb_results", "test.ani.csv")
destfile
# Download the compressed file
download.file(url = gtdb_test_url, destfile = destfile, mode = "wb")
# Read the .csv file into R
<- readr::read_csv(destfile) test_gtdb_result
Rows: 5345 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): old_species, new_species, accession, phenotype, antibiotic, measure...
dbl (1): ani
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# [1] 5345 9
Distribution of SRA IDs
# Count unique accessions per species in the training metadata
<- training_metadata_db %>%
train_counts group_by(scientific_name_CAMDA) %>%
summarise(train_num_accessions = n_distinct(accession))
# Count unique accessions per species in the testing metadata
<- test_metadata_db %>%
test_counts group_by(scientific_name_CAMDA) %>%
summarise(test_num_accessions = n_distinct(accession))
# Combine the counts by species, keeping all species present in the training set
<- train_counts %>%
combined_counts left_join(test_counts, by = "scientific_name_CAMDA") %>%
# Replace NA in testing counts with 0 for species not present in the testing data
mutate(test_num_accessions = ifelse(is.na(test_num_accessions), 0, test_num_accessions))
# Display the combined counts table
combined_counts
# A tibble: 9 × 3
scientific_name_CAMDA train_num_accessions test_num_accessions
<chr> <int> <int>
1 Acinetobacter baumannii 488 597
2 Campylobacter jejuni 537 410
3 Escherichia coli 512 510
4 Klebsiella pneumoniae 791 700
5 Neisseria gonorrhoeae 741 679
6 Pseudomonas aeruginosa 571 548
7 Salmonella enterica 640 700
8 Staphylococcus aureus 571 597
9 Streptococcus pneumoniae 607 604
These tables may contain duplicates and incorrect annotations in the species names.
source("functions/compare_sp_function.R")
The resulting dataset includes two status columns: status
and status_reference
:
status
indicates how the species annotation from the CAMDA dataset compares to external sources:
Matched_NCBI_Antibiogram
: CAMDA matches both NCBI and antibiogram data.
Matched_NCBI
: CAMDA matches only NCBI data.
Matched_Antibiogram
: CAMDA matches only the antibiogram data.
Missing_All
: No sufficient external data available for comparison.
status_reference
evaluates the consistency between the two external sources (NCBI and antibiogram):
Good_sources
: NCBI and antibiogram annotations agree with each other.Verify_sources
: NCBI and antibiogram annotations disagree and require further manual verification to ensure data reliability.When the species annotations provided by NCBI and the antibiogram metadata do not match, it raises concerns about the accuracy of the sample’s taxonomic identity. Such discrepancies may indicate mislabeling, contamination, or outdated records. Therefore, it is essential to verify the original sources to ensure the reliability of downstream analyses.
source(here("functions", "compare_genus_function.R"))
Find how many species have matching names between CAMDA and NCBI to identify correctly annotated entries.
Join information
<- rbind(
gtdb_results # train
%>%
train_gtdb_result select(old_species, new_species, accession, genome) ,
# test
%>%
test_gtdb_result select(old_species, new_species, accession, genome)
)
dim(gtdb_results) # 5752+5345
[1] 11097 4
The information for the SRA IDs in the test and training datasets was verified:
We detected that 1,290 SRA IDs are shared between both files.
<- test_metadata_db %>%
testing_metadata_cleaned # Remove the test IDs that are shared with the training set
filter(!(accession %in% training_metadata_db$accession)) %>% # 1,290
# Unir con la informacion de gtdb
left_join(select(test_gtdb_result,
by ="accession") %>%
new_species, accession, genome, ani), # redondear ANI
mutate(ani = round(ani, 3))
# 4055 SRA IDs accessions by row
nrow(testing_metadata_cleaned) #4055
[1] 4055
# Numero de accesiones
length(unique(testing_metadata_cleaned$accession))#4055
[1] 4055
# Sin accession duplicados
all(duplicated(testing_metadata_cleaned$accession))
[1] FALSE
# Una sola accesion por fila
# Sin ani con NA
all(is.na(testing_metadata_cleaned$ani))
[1] FALSE
Save file:
# Create new folder
#dir.create("../rawdata/TrainAndTest_cleaned")
write_tsv(testing_metadata_cleaned, file = here("rawdata/TrainAndTest_cleaned", "testing_metadata_cleaned.tsv"), quote = "none")
Check information
length(unique(testing_metadata_cleaned$accession))
[1] 4055
#[1] 4055
length(unique(test_gtdb_result$accession)) # 1,290 SRA IDs are shared between both files.
[1] 5345
#[1] 5345
# > Informacion de GTDB dataset
# Filtrar solo los test que no aparecen en training
<- test_gtdb_result %>%
test_gtdb_cleaned filter(accession %in% testing_metadata_cleaned$accession) %>%
# Redondear ANI
mutate(ani = round(ani, 3))
dim(test_gtdb_cleaned)
[1] 4055 9
# [1] 4055 9
Verificar especies y eliminar datos incorrectos
# POR MODIFICAR
# Ya la debo de cambiar porque GTDB es mejor en los resultados que contiene
<- compare_sp(testing_metadata_cleaned, sra_metadata_db, antibiograms_db)
test_completeInfo_db dim(test_completeInfo_db)
[1] 5458 6
#[1] 5458 6
# ---- scientific_name_new -------
# Detect different genus between species using GTDB dataset
<- compare_genus(testing_metadata_cleaned, old_col = "scientific_name_CAMDA", new_col = "new_species") test_completeinfo_cleaned_db
Extrayendo el genero de las especies
Determinando especies erroneas
Agregar clasificaciones
table(test_completeinfo_cleaned_db$genus_match)
TRUE FALSE NA
3906 6 143
# TRUE FALSE NA
# 3906 6 143
# agregar nueva clasiifacion en los nombres de las especies
$scientific_name_new <- ifelse(
test_completeinfo_cleaned_db# Si new_species es NA, mantiene scientific_name_CAMDA.
is.na(test_completeinfo_cleaned_db$new_species),
$scientific_name_CAMDA,
test_completeinfo_cleaned_db# Si son iguales, conserva scientific_name_CAMDA.
ifelse(
$scientific_name_CAMDA == test_completeinfo_cleaned_db$new_species,
test_completeinfo_cleaned_db$scientific_name_CAMDA,
test_completeinfo_cleaned_db$new_species))
test_completeinfo_cleaned_db
dim(test_completeinfo_cleaned_db)
[1] 4055 14
# [1] 4055 14
Hay 6 SRA con problemas en la especie
%>%
test_completeinfo_cleaned_db filter(genus_match == "FALSE")
# A tibble: 6 × 14
genus species accession phenotype antibiotic measurement_value
<chr> <chr> <chr> <chr> <chr> <chr>
1 Klebsiella pneumoniae SRR5385772 ? GEN ?
2 Neisseria gonorrhoeae SRR3360755 ? TET ?
3 Acinetobacter baumannii SRR1180875 ? CAZ ?
4 Acinetobacter baumannii SRR1180930 ? CAZ ?
5 Acinetobacter baumannii SRR1187769 ? CAZ ?
6 Campylobacter jejuni ERR9766771 ? TET ?
# ℹ 8 more variables: scientific_name_CAMDA <chr>, new_species <chr>,
# genome <chr>, ani <dbl>, old_genus <chr>, new_genus <chr>,
# genus_match <fct>, scientific_name_new <chr>
Unificar las columnas entre training y test
# global information
<- c("genus", "species", "scientific_name_new", "accession", "genome", "phenotype", "antibiotic", "measurement_value", "ani")
global_cols
# test informacion
<- test_completeinfo_cleaned_db %>%
test_db_cleaned select(any_of(global_cols))
Save data
write_tsv(test_db_cleaned, file = here("rawdata/TrainAndTest_cleaned", "Test_allstatus_sp_reference_metadata.tsv.gz"), quote = "none")
# How many unique that we have?
cat("Number of SRA ID unique (accessions)", length(unique(training_metadata$accession)), sep = " ") # 5458 SRA ID uniqued
Number of SRA ID unique (accessions) 5458
# Numero de SRA IDs que se repiten varias veces?
<- table(training_metadata$accession)
training_metadata_duplicated sum(training_metadata_duplicated > 1) # 502 SRA IDs
[1] 502
# How many duplicated that we have?
<- training_metadata %>%
training_duplicated_table group_by(accession) %>%
filter(n() > 1) %>% # filtra accession con más de una ocurrencia
summarise(duplicados = n()) %>%
arrange(desc(duplicados))
# Only IDs (502 SRA IDs)
<- training_duplicated_table$accession
training_SRAIDs_duplicated
# Obtain duplicated names
<- training_metadata %>%
training_duplicated_db filter(accession %in% training_SRAIDs_duplicated)
<- training_metadata_db %>%
training_metadata_cleaned # Unir con la informacion de gtdb
left_join(select(train_gtdb_result,
new_species, accession, genome, ani), by ="accession", relationship = "many-to-many") %>%
# redondear ANI
mutate(ani = round(ani, 3)) %>%
# Eliminar filas duplicadas
distinct()
# 5458 SRA IDs accessions by row
nrow(training_metadata_cleaned) #5729
[1] 5729
# Numero de accesiones
length(unique(training_metadata_cleaned$accession))#5458
[1] 5458
# Sin ani con NA
all(is.na(training_metadata_cleaned$ani))
[1] FALSE
# Si hay filas con mas duplicados, accesiones con mas genomas
Checar IDs repetidos
<- table(training_metadata_cleaned$accession)
table_accession <- table_accession[table_accession > 1]
repeated_accessions length(repeated_accessions) # Cuántos accesiones se repiten, 220
[1] 220
# Ver filas duplicadas completas
<- training_metadata_cleaned %>%
duplicated_rows filter(accession %in% names(repeated_accessions)) %>%
arrange(accession)
%>%
duplicated_rows ::datatable() DT
Check information
length(unique(training_metadata_cleaned$accession))
[1] 5458
#[1] 5458
length(unique(train_gtdb_result$accession)) #
[1] 5458
#[1] 5458
dim(train_gtdb_result) #5752
[1] 5752 19
dim(training_metadata_cleaned) # 5729
[1] 5729 20
Limpieza de especies
# POR MODIFICAR
# Ya la debo de cambiar porque GTDB es mejor en los resultados que contiene
<- compare_sp(training_metadata_db, sra_metadata_db, antibiograms_db)
training_completeInfo_db dim(training_completeInfo_db)
[1] 5458 6
#[1] 5458 6
# ---- scientific_name_new -------
# Detect different genus between species using GTDB dataset
<- compare_genus(training_metadata_cleaned, old_col = "scientific_name_CAMDA", new_col = "new_species") %>%
training_completeinfo_cleaned_db # Delete problems with species
filter(genus_match != FALSE)
Extrayendo el genero de las especies
Determinando especies erroneas
Agregar clasificaciones
table(training_completeinfo_cleaned_db$genus_match)
TRUE FALSE NA
5237 0 483
# TRUE FALSE NA
# 5497 9 1874
# agregar nueva clasiifacion en los nombres de las especies
$scientific_name_new <- ifelse(
training_completeinfo_cleaned_db# Si new_species es NA, mantiene scientific_name_CAMDA.
is.na(training_completeinfo_cleaned_db$new_species),
$scientific_name_CAMDA,
training_completeinfo_cleaned_db# Si son iguales, conserva scientific_name_CAMDA.
ifelse(
$scientific_name_CAMDA == training_completeinfo_cleaned_db$new_species,
training_completeinfo_cleaned_db$scientific_name_CAMDA,
training_completeinfo_cleaned_db$new_species))
training_completeinfo_cleaned_db
dim(training_completeinfo_cleaned_db)
[1] 5720 24
# [1] 5720 12
# Global information
%>%
training_completeinfo_cleaned_db # Visualizar informacion en una tabla bonita
::datatable() DT
Warning in instance$preRenderHook(instance): It seems your data is too big for
client-side DataTables. You may consider server-side processing:
https://rstudio.github.io/DT/server.html
table(training_completeinfo_cleaned_db$scientific_name_new)
Acinetobacter baumannii Acinetobacter courvalinii
564 1
Acinetobacter nosocomialis Campylobacter jejuni
3 537
Escherichia coli Klebsiella africana
507 1
Klebsiella pneumoniae Klebsiella quasipneumoniae
767 11
Klebsiella variicola Neisseria gonorrhoeae
12 751
Pseudomonas aeruginosa Salmonella enterica
571 715
Staphylococcus aureus Streptococcus pneumoniae
573 707
Unificar las columnas entre training y test
<- c(global_cols, "old_genus", "new_genus")
columns_to_select
# test informacion
<- training_completeinfo_cleaned_db %>%
training_db_cleaned select(any_of(columns_to_select)) %>%
# Si new_genus tiene NA, colocar la informacion de old_genus y si no, dejar igual
mutate(new_genus = if_else(is.na(new_genus),old_genus, new_genus )) %>%
select(-old_genus)
dim(training_db_cleaned)
[1] 5720 10
Save data
write_tsv(training_completeinfo_cleaned_db, file = here("rawdata/TrainAndTest_cleaned", "Training_allstatus_sp_reference_metadata.tsv.gz"), quote = "none")
Aqui debo unir el archivo de nuevos MIC y fenotipos
Save data
write_tsv(training_db_cleaned, file = here("rawdata/TrainAndTest_cleaned", "training_metadata_cleaned.tsv"), quote = "none")
# NO CORRER - POR EDITAR
%>%
training_completeinfo_cleaned_db group_by(scientific_name_CAMDA) %>%
summarise(
total_samples = n(),
n_missing_data = sum(status == "Missing_All"),
pct_missing_data = (n_missing_data / total_samples) * 100) %>%
arrange(desc(n_missing_data)) %>%
# Change NA by Zero
mutate(across(everything(), ~replace_na(., 0)))
7 out of 9 species have missing or incorrect data.
The data from NCBI does not match the information available in the antibiogram records.
<- training_db_cleaned %>%
training_verify_data filter(status_reference == "Verify_sources")
table(training_completeInfo_db$status_reference)
Verify the total number of SRA IDs.
# Count unique accessions per species in the training metadata
<- training_db_cleaned %>%
train_counts_cleaned group_by(scientific_name_new) %>%
summarise(trainClean_num_accessions = n_distinct(accession))
# Count unique accessions per species in the testing metadata
<- test_db_cleaned %>%
test_counts_cleaned group_by(scientific_name_new) %>%
summarise(testClean_num_accessions = n_distinct(accession))
# Combine the counts by species, keeping all species present in the training set
<- train_counts_cleaned %>%
combined_counts_completed left_join(test_counts_cleaned, by = "scientific_name_new")
# sum columns
<- combined_counts_completed %>%
df_total summarise(across(where(is.numeric), ~sum(.x, na.rm = TRUE))) %>%
mutate(rowname = "Total") %>%
select(rowname, everything())
<- combined_counts_completed %>%
combined_counts_completed mutate(rowname = rownames(.)) %>%
select(rowname, everything()) %>%
bind_rows(df_total)
# Display the combined counts table
%>%
combined_counts_completed select(-rowname) %>%
::datatable() DT
Save
save(antibiograms_db, test_db_cleaned, training_db_cleaned, sra_metadata_db, test_completeinfo_cleaned_db, training_completeinfo_cleaned_db, file = here("rawdata/TrainAndTest_cleaned", "complete_metadata.RData"))
# Visualize phenotype info for Acinetobacter baumannii
%>%
training_db_cleaned group_by(scientific_name_new, phenotype) %>%
summarise(count = n(), .groups = "drop") %>%
pivot_wider(
names_from = phenotype,
values_from = count,
values_fill = 0
)
# A tibble: 14 × 4
scientific_name_new Intermediate Resistant Susceptible
<chr> <int> <int> <int>
1 Acinetobacter baumannii 146 248 170
2 Acinetobacter courvalinii 0 0 1
3 Acinetobacter nosocomialis 0 0 3
4 Campylobacter jejuni 0 326 211
5 Escherichia coli 19 148 340
6 Klebsiella africana 0 1 0
7 Klebsiella pneumoniae 133 326 308
8 Klebsiella quasipneumoniae 0 2 9
9 Klebsiella variicola 0 3 9
10 Neisseria gonorrhoeae 148 336 267
11 Pseudomonas aeruginosa 94 249 228
12 Salmonella enterica 19 346 350
13 Staphylococcus aureus 46 222 305
14 Streptococcus pneumoniae 53 311 343
How Many SRA are repeated within each species but assigned to different phenotypes
%>%
training_db_cleaned group_by(scientific_name_new, accession) %>%
summarise(phenotype_count = n_distinct(phenotype), .groups = "drop") %>%
filter(phenotype_count > 1) %>%
count(scientific_name_new, name = "num_conflicting_accessions")
# A tibble: 6 × 2
scientific_name_new num_conflicting_accessions
<chr> <int>
1 Acinetobacter baumannii 32
2 Escherichia coli 1
3 Neisseria gonorrhoeae 5
4 Salmonella enterica 1
5 Staphylococcus aureus 3
6 Streptococcus pneumoniae 6
To break down the count of conflicting accessions per species and per phenotype, you can extend the logic by grouping further by phenotype after identifying accessions with multiple phenotypes.
<- training_db_cleaned %>%
training_diff_phenotype_summary group_by(scientific_name_new, accession) %>%
filter(n_distinct(phenotype) > 1) %>%
distinct(scientific_name_new, accession, phenotype) %>%
count(scientific_name_new, phenotype, name = "num_conflicting_accessions_per_phenotype")
head(training_diff_phenotype_summary)
# A tibble: 6 × 4
# Groups: scientific_name_new, accession [3]
scientific_name_new accession phenotype num_conflicting_accessions_p…¹
<chr> <chr> <chr> <int>
1 Acinetobacter baumannii SRR2916633 Intermediate 1
2 Acinetobacter baumannii SRR2916633 Resistant 1
3 Acinetobacter baumannii SRR3131162 Intermediate 1
4 Acinetobacter baumannii SRR3131162 Resistant 1
5 Acinetobacter baumannii SRR3131165 Intermediate 1
6 Acinetobacter baumannii SRR3131165 Resistant 1
# ℹ abbreviated name: ¹num_conflicting_accessions_per_phenotype
# Identify species with more than one assigned phenotype
<- training_db_cleaned %>%
training_diff_phenotype_db filter(accession %in% training_diff_phenotype_summary$accession)
dim(training_diff_phenotype_db)# [1] 108 21
[1] 108 10
Save files
write_tsv(training_diff_phenotype_db, file = here("rawdata/diff_phenotype","training_diff_phenotype_db.tsv"))
write_tsv(training_diff_phenotype_summary, file = here("rawdata/diff_phenotype", "training_diff_phenotype_summary.tsv"))