ASC25-Q4

2025-02-05 25 min

Proposal for ASC25-Q4

RNA m5C Modification Site Detection and Performance Optimization Challenge

Introduction

Overview

RNA molecules undergo various chemical modifications that play key roles in regulating gene expression, post-transcriptional processes, and protein synthesis. Among these modifications, 5-methylcytosine (m^5^C) is particularly important, influencing a wide range of RNA species and impacting numerous biological functions. The primary challenge in m^5^C detection lies in balancing the need for accuracy and reliability with the goal of minimizing false positives and increasing processing speed. Our team has rigorously tested and validated the modified workflow, ensuring it reliably identifies high-confidence m^5^C candidate sites with minimal false-positive rates, while minimizing the overall processing time.

Machine Specifications

Hardware resources

Table 4.1 provides an overview of the hardware resources utilized in this study.

Table 4.1: Hardware Specifications

Component	Specification
CPU	Intel^®^ Xeon^®^ Gold 6330 (112 cores, 2.00 GHz)
Memory	504GB
System Architecture	x86_64

Software Environment and Dependencies

Table 4.2 summarizes the operating system and core environment employed in this study.

Table 4.2: Core Software Components

Software of this platform	Version/Details
Operating System	Ubuntu 20.04.6 LTS
Linux kernel	5.4.0-148-generic
GCC	(Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Make	GNU Make 4.2.1

Conda-Based Virtual Environment
Conda is utilized to set up the environment, ensuring reproducibility and ease of replication. All package dependencies are specified in environment.yml to maintain a consistent computational environment across different setups. You can easily recreate the environment by running the following command:
```
conda env create -f environment.yml
```
The specific versions of core packages utilized in this study are listed below to ensure complete reproducibility:
```
python=3.11.8=hab00c5b_0_cpython
samtools=1.21=h96c455f_1
umicollapse=1.1.0=hdfd78af_0
htslib=1.21=h566b1c6_1
```

Manual compilation part

hisat

# in the path: /home/milefer7/opt
git clone https://github.com/DaehwanKimLab/hisat2.git hisat-3n
cd hisat-3n
git checkout -b hisat-3n origin/hisat-3n
make

Configuration

According to the guidelines outlined in the Processing Software Guidance and Parameters section of the ASC Student Supercomputer Challenge 2025 Preliminary Round Announcement (hereafter referred to as the announcement), we identified several parameters that needed adjustment in config.yaml and Snakefile.

After implementing these modifications, the project executes successfully.

Config.yaml

The Config.yaml file has been modified as follows.

# tools and scripts
path:
  samtools: /home/milefer7/miniconda3/envs/asc04/bin/samtools
  hisat3n: /home/milefer7/opt/hisat-3n/hisat-3n
  hisat3ntable: /home/milefer7/opt/hisat-3n/hisat-3n-table
  umicollapse: /home/milefer7/miniconda3/envs/asc04/share/umicollapse-1.1.0-0/umicollapse.jar
  bgzip: /home/milefer7/miniconda3/envs/asc04/bin/bgzip
  join_pileup.py: ../bin/join_pileup.py
  group_pileup.py: ../bin/group_pileup.py
  select_sites.py: ../bin/select_sites.py
  filter_sites.py: ../bin/filter_sites.py

# global config
library: INLINE # According to Workflow Example

# prepare genes index
customized_genes:
  - ../data/ref/fastq/NSUN2.fa
  - ../data/ref/fastq/DNMT2.fa
makedup: false 

# reference genome and index
reference:
  contamination:
    fa: ../data/ref/fastq/rRNA.fa
    hisat3n: ../data/ref/index/rRNA
  genes:
    fa: ../data/ref/fastq/tRNA.fa
    hisat3n: ../data/ref/index/tRNA
  genome:
    fa: ../data/ref/fastq/GRCh38.p14.genome.fa
    hisat3n: ../data/ref/index/Homo_sapiens.GRCh38.genome

# samples
samples:
  SRR23538290: # Change the sample name to gene number
    data: # Because it is single-ended, only R1 is left.
      - R1: ../data/samples/SRR23538290_1.fastq.gz 
    group: DRUG # Three samples all belong to the drug group
  SRR23538291:
    data:
      - R1: ../data/samples/SRR23538291_1.fastq.gz 
    group: DRUG
  SRR23538292:
    data:
      - R1: ../data/samples/SRR23538292_1.fastq.gz
    group: DRUG

Besides, Our team encountered some configuration issues, which we highlight here for reference only.

GLIBCXX_3.4.29 Missing Issue. The solution is provided below. You can verify installation by running strings libstdc++.so.6 | grep GLIBCXX_3.4*.
```
sudo apt install software-properties-common
sudo add-apt-repository -y ppa:ubuntu-toolchain-r/test
sudo apt install -y g++-11
```
Missing -e Option in Samtools. This occurs because the installed version is lower than 1.17. To resolve this issue, upgrade Samtools to version 1.17 or higher.
CBC (Coin-or Branch and Cut) Issue
```
pulp.apis.core.PulpSolverError: Pulp: Error while trying to execute, use msg=True for more detailscbc
libCbcSolver.so.3: cannot open shared object file
```
The issue arises because the system cannot locate the CBC (Coin-or Branch and Cut) solver. To verify this, run cbc -v. The problem is often caused by multiple CBC versions installed via Conda. To resolve it, remove the extra versions and retain only coin-or-cbc.

Snakefile

In the cutadapt_SE rule, an error occurs due to library: INLINE, resulting in the following exception:
```
TypeError: 'NoneType' object is not iterable
```
Solution is to modify the shell command by removing the output parameter: report="report_reads/trimming/{sample}_{rn}.json",. Additionally, remove the --json-file option from the shell command.
This adjustment resolves the issue and ensures proper execution of the rule.
In the hisat2_3n_mapping_genome_SE rule, an error occurs due to excessive memory usage.
```
Killed (ERR): hisat2-align exited with value 137
```
1. Invalid Attempts
  1. Reduced concurrency for hisat2_3n_mapping_genes_SE from 24 to 12, then to 6.
  2. Reduced concurrency for hisat2_3n_mapping_genome_SE from 24 to 12, then to 6.
  3. Despite these adjustments, memory usage consistently exceeded the 504GB limit, leading to resource exhaustion and extremely slow processing.
2. Issue Diagnosis
  1. Large Input Data Volume: After previous filtering steps, the remaining reads (which were not mapped to the contamination database) may still contain a significant number of complex small RNAs (e.g. tRNA), requiring precise alignment.
  2. High Index Complexity: The high sequence similarity within the reference genome results in an exponentially larger search space during alignment, significantly increasing memory usage.
3. Proposed Solution
  1. Optimize Shell Parameters in rule hisat2_3n_mapping_genome_SE.
    1. Through analysis, we identified that the --all parameter is the primary cause of excessive memory usage. This option forces HISAT to report all possible alignments for each read, significantly increasing memory consumption and computational time.
    2. By eliminating --all, the workflow now executes smoothly without resource exhaustion.

In the prepare_genes_index rule, the shell command should explicitly specify the path to hisat-3n-build, as follows:

shell:
        """
        cat {input} >{output.fa}
        rm -f `dirname {output.index}`/`basename {output.index} ".CT.1.ht2"`.*.ht2
        /home/milefer7/opt/hisat-3n/hisat-3n-build -p 12 --base-change C,T {output.fa} {params.index}
        """

Methods

Workflow Design

To avoid too much confusion, we take a set of samples to show the flow chart(Figure 4.1).

Figure 4.1: Sample Accession Numbers

dag_with_customized_genes

Compared with the original snakefile flow chart, we did not make too many changes in the process, but only added a job, prepare_genes_index.

Dataset Acquisition

Sample Preparation

The raw sequencing data used in this study were obtained from the NCBI Sequence Read Archive (SRA). The samples retrieved are listed in Table 4.3.
Table 4.3: Sample Accession Numbers
Sample number SRA access number
GSM7051146 SRX19430740
GSM7051147 SRX19430741
GSM7051148 SRX19430742
The data could be downloaded using the SRA Toolkit, specifically the prefetch command.prefetch SRX19430740 SRX19430741 SRX19430742
After downloading, the .sra files were converted into FASTQ format using the fasterq-dump utility with the --split-files option. The corresponding SRA run accession numbers were SRR23538290, SRR23538291, and SRR23538292.
```
fasterq-dump --split-files SRR23538290
fasterq-dump --split-files SRR23538291
fasterq-dump --split-files SRR23538292
```

Sample number	SRA access number
GSM7051146	SRX19430740
GSM7051147	SRX19430741
GSM7051148	SRX19430742

Compress to gz file

fastq-dump --split-files --gzip SRR23538290
fastq-dump --split-files --gzip SRR23538291
fastq-dump --split-files --gzip SRR23538292

Reference Preparation

According to the ASC Student Supercomputer Challenge 2025 Preliminary Round Announcement (hereafter referred to as the announcement), two reference sequences are required:

Homo_sapiens.GRCh38.dna.primary_assembly.fa
Homo_sapiens.GRCh38.ncrna.fa

To prepare the reference, we need to generate an index using hisat-3n-build.

Filtering rRNA and tRNA from ncRNA Reference and Build Index

As outlined in the Guidelines for the Procedures section of the announcement, the input reference sequences include rRNA, tRNA, and genome reference sequences. Before constructing the index for Homo_sapiens.GRCh38.ncrna.fa, it is essential to first filter out specific RNA types.

The Homo_sapiens.GRCh38.ncrna.fa file contains various RNA categories, including Mt_rRNA, lncRNA, scRNA, scaRNA, vault_RNA, misc_RNA, miRNA, snoRNA, snRNA, sRNA, Mt_tRNA, and rRNA, among others. However, for our analysis, only Mt_rRNA, Mt_tRNA, and rRNA are required.

Therefore, the first step in preprocessing is to extract these specific RNA sequences from the ncRNA reference, ensuring that only the relevant RNA types are retained for downstream analysis.

To achieve this, we have developed a script to filter and extract the relevant RNA types. The key function in this script is extract_rrna_trna, which isolates the required sequences for downstream processing. Below, we provide the script details:

def extract_rrna_trna(input_fasta, output_rrna, output_trna):
    rrna_records = []
    trna_records = []
    
    for record in SeqIO.parse(input_fasta, "fasta"):
        description = record.description.lower()
        if "gene_biotype:rrna" in description or "transcript_biotype:rrna" in description or "gene_biotype:mt_rrna" in description or "transcript_biotype:mt_rrna" in description:
            rrna_records.append(record)
        elif "gene_biotype:mt_trna" in description or "transcript_biotype:mt_trna" in description:
            trna_records.append(record)
    
    SeqIO.write(rrna_records, output_rrna, "fasta")
    SeqIO.write(trna_records, output_trna, "fasta")

Run the python file
- python extract_rrna_trna.py Homo_sapiens.GRCh38.ncrna.fa rRNA.fa tRNA.fa
  - Input: Homo_sapiens.GRCh38.ncrna.fa
  - Output: rRNA.fa and tRNA.fa

Lastly, we utilized hisat-3n-build to generate the index.

Build Homo_sapiens.GRCh38.dna.primary_assembly.fa Index

Build Homo_sapiens.GRCh38.dna.primary_assembly.fa index directly.

Customized_genes Preparation

NSUN2^[1]^ and DNMT2^[2]^ are the core “writers” of RNA m5C modification, and their functions have been confirmed in multiple studies to be directly related to m5C methylation. The specific reasons are as follows:

NSUN:
- Mainly catalyzes m5C methylation of tRNA and mRNA, affecting RNA stability, translation efficiency and stress resistance.
DNMT2:
- Although initially mistaken for a DNA methyltransferase, it is actually a key enzyme for m5C methylation of tRNA, protecting tRNA from cleavage especially under stress conditions.
- DNMT2 and NSUN2 are complementary in tRNA methylation, and double knockout causes embryonic lethality in mice.

The two files have been downloaded and stored in the data/ref/fastq directory.

Dataset Acquisition Conclusion

The data file structure is as follows.

data/
├── ref
│   ├── fastq
│   │   ├── DNMT2.fa
│   │   ├── GRCh38.p14.genome.fa
│   │   ├── Homo_sapiens.GRCh38.ncrna.fa
│   │   ├── NSUN2.fa
│   │   ├── rRNA.fa
│   │   └── tRNA.fa
│   └── index
│       ├── Homo_sapiens.GRCh38.genome.3n.*
│       ├── Homo_sapiens.GRCh38.ncrna.fa.3n.*
│       ├── rRNA.3n.CT.*
├── samples
│   ├── SRR23538290_1.fastq.gz
│   ├── SRR23538291_1.fastq.gz
│   └── SRR23538292_1.fastq.gz

All processing steps were performed in a Conda-managed environment, ensuring reproducibility.

Optimization Strategies

In this section, we discuss the optimization strategies implemented by our team to enhance performance and address existing issues.

Parallel Scheduling

Discussion and Analysis

The core idea is to allocate computational resources efficiently to ensure that all three samples can be processed in parallel. The system is equipped with 112 CPU cores, with 4 cores reserved for operating system processes, leaving 108 cores available for computation. By evenly distributing these resources, each sample is allocated 36 cores for processing.

Furthermore, different computational tasks exhibit varying characteristics: some are I/O-intensive, others are compute-intensive, and some represent a hybrid workload. The allocation strategy should account for these differences to optimize overall system performance and resource utilization.

Table 4.2 presents a comparison of thread results before and after the changes.

Table 4.2: Thread Performance Comparison

rule_name	job_count	threads
combine_runs	9	8
cutadapt_SE	3	20 -> 36
dedup_mapping	6	20 -> 3
extract_unmap_bam_final_SE	3	4 -> 1
extract_unmap_bam_internal_SE	9	4
group_pileup	2	6
hisat2_3n_calling_filtered_multi	6	16 -> 5
hisat2_3n_calling_filtered_unqiue	6	16 -> 5
hisat2_3n_calling_unfiltered_multi	6	16 -> 5
hisat2_3n_calling_unfiltered_unique	6	16 -> 5
hisat2_3n_filtering	6	4
hisat2_3n_mapping_contamination_SE	3	24 -> 10
hisat2_3n_mapping_genes_SE	3	24 -> 10
hisat2_3n_mapping_genome_SE	3	24 -> 28
hisat2_3n_sort	9	16 -> 4
join_pileup	6	6
prepare_genes_index	1	12
stat_mapping_number	3	4
stat_sample_background	6	2

The explanations are provided below.

cutadapt_SE
- Since three samples are processed in parallel, the available 108 computing cores are evenly distributed among them: $\frac{108}{3}=36$ . Thus, each sample is allocated 36 threads to ensure efficient processing.
dedup_mapping
- According to practical tests, allocating too many threads will take more time. Finally, we decided to allocate 3 threads.
extract_unmap_bam_final_SE
- This step primarily involves a file-moving operation (mv), which is not computationally intensive. Consequently, this process requires only a single thread to execute efficiently.
hisat2_3n_calling_*
- This step consists of 24 independent jobs. To maximize parallel execution while maintaining computational efficiency, the thread count is reduced to 5 per job, facilitating optimal resource utilization.
hisat2_3n_mapping_*
- According to the workflow diagram, these three jobs can be executed simultaneously. The genome dataset has an approximate size of 3 GB, which is substantially larger than the contamination and gene datasets. Given this disparity in data volume, prioritizing higher thread allocation for genome mapping is necessary to enhance computational efficiency.
- Based on empirical performance evaluations, the thread distribution is as follows:
  - hisat2_3n_mapping_genome_SE -> 28 threads
  - The remaining two tasks -> 10 threads each
hisat2_3n_sort
- Following the execution of hisat2_3n_mapping_genome_SE (28 threads), the subsequent steps involve hisat2_3n_sort and extract_unmap_bam_internal_SE (4 threads).
- Given that each sample includes three parallel jobs, the total computational workload per sample is: $\frac{24}{3}=8$
- Since this step is not purely computationally intensive, real-world performance assessments indicate that an allocation of 4 threads per job yields the most efficient execution.

Comparison of Optimization Outcomes

Conduct a comparative analysis of v0.1 (a git branch without modifying thread allocation) and v0.25 (with thread modifications) to examine the details.

Table 4.3: Comparison of Parallel Scheduling Optimization Outcomes

Rule Name	total_time_before (min)	total_time_after (min)	Time Saved (min)	Improvement (%)
combine_runs	4.30	4.30	0.00	0.00%
cutadapt_SE	11.10	9.97	1.13	10.18%
dedup_mapping	83.10	83.47	-0.37	-0.45%
extract_unmap_bam_final_SE	0.00	0.00	0.00	0.00%
extract_unmap_bam_internal_SE	4.65	4.52	0.13	2.80%
group_pileup	4.68	5.15	-0.47	-10.04%
hisat2_3n_calling_filtered_multi	175.67	153.77	21.90	12.45%
hisat2_3n_calling_filtered_unqiue	176.03	155.82	20.21	11.48%
hisat2_3n_calling_unfiltered_multi	176.00	154.68	21.32	12.11%
hisat2_3n_calling_unfiltered_unique	171.98	151.78	20.20	11.74%
hisat2_3n_filtering	9.52	8.10	1.42	14.92%
hisat2_3n_mapping_contamination_SE	24.48	18.88	5.60	22.88%
hisat2_3n_mapping_genes_SE	24.02	17.70	6.32	26.32%
hisat2_3n_mapping_genome_SE	31.22	29.02	2.20	7.05%
hisat2_3n_sort	9.85	14.27	-4.42	-44.82%
join_pileup	8.52	8.25	0.27	3.17%
prepare_genes_index	0.02	0.00	0.02	100.00%
stat_mapping_number	0.65	0.55	0.10	15.38%
stat_sample_background	1.63	1.42	0.21	12.88%

A slight time deviation of several minutes per run is considered normal. Specifically, for hisat2_3n_sort, its thread allocation was reduced from 16 to 4, resulting in an average slowdown of 30 seconds. However, this reallocation provided greater overall efficiency for other rules.This trade-off highlights the importance of balancing individual task performance with global optimization to maximize throughput in multi-threaded pipelines.

Compiler Optimization

Discussion and Analysis

After optimizing thread parallelism, another approach under discussion is compiler optimization. Previously, we used the default configuration(conda install) without enabling the O3 optimization flag. Since the most time-consuming processes are hisat2_3n_calling* and hisat2_3n_mapping*, optimizing HISAT3 is the top priority. Additionally, enabling O3 optimization for SAMtools and BGZip could further enhance performance.

To achieve this, we modified the Makefile, adjusting the parameters of RELEASE_FLAGS and LDFLAGS as follows:

RELEASE_FLAGS = -O3 -march=native -funroll-loops -fprefetch-loop-arrays \
                -flto -fomit-frame-pointer -floop-interchange -floop-unroll-and-jam \
                -fpeel-loops -funswitch-loops -floop-block
LDFLAGS = -static -flto

Justification for these optimizations:

Table 4.4: Compiler Optimization Flags and Their Effects

Flag	Description
`-O3`	Enables aggressive optimizations, including loop unrolling and vectorization.
`-march=native`	Generates code optimized for the target CPU, utilizing processor-specific instructions.
`-funroll-loops`	Unrolls loops to reduce branching and improve execution efficiency.
`-fprefetch-loop-arrays`	Prefetches loop data to reduce memory latency.
`-flto`	Enables Link-Time Optimization (LTO) for cross-file optimizations and reduced binary size.
`-fomit-frame-pointer`	Omits frame pointers to free up registers and enhance performance.
`-floop-interchange`	Reorders nested loops to improve cache utilization.
`-floop-unroll-and-jam`	Unrolls and fuses loops for better parallel execution.
`-fpeel-loops`	Extracts iterations from loops to improve scheduling.
`-funswitch-loops`	Moves invariant conditions outside loops to reduce branch mispredictions.
`-floop-block`	Blocks loops to improve memory locality and performance.
`-static` (LDFLAGS)	Produces a statically linked binary, reducing runtime dependencies.

By integrating these compiler optimizations, we aim to significantly enhance the efficiency of HISAT3, SAMtools, and BGZip, leading to faster execution times and better resource utilization.

Comparison of Optimization Outcomes

Conduct a comparative analysis of v0.25 (a Git branch with thread modifications) and v0.5 (which includes both thread modifications and the use of O3 optimization tools such as HISAT, Samtools, and Bgzip) to examine their differences in detail.

Table 4.5: Comparison of Compiler Optimization Outcomes

Rule Name	Total Time Before (min)	Total Time After (min)	Time Saved (min)	Improvement (%)
combine_runs	4.30	8.45	-4.15	-96.51%
cutadapt_SE	9.97	9.93	0.04	0.40%
dedup_mapping	83.47	82.58	0.89	1.07%
extract_unmap_bam_internal_SE	4.52	6.17	-1.65	-36.50%
group_pileup	5.15	5.33	-0.18	-3.50%
hisat2_3n_calling_filtered_multi	153.77	155.83	-2.06	-1.34%
hisat2_3n_calling_filtered_unqiue	155.82	153.45	2.37	1.52%
hisat2_3n_calling_unfiltered_multi	154.68	156.65	-1.97	-1.27%
hisat2_3n_calling_unfiltered_unique	151.78	153.68	-1.90	-1.25%
hisat2_3n_filtering	8.10	10.35	-2.25	-27.78%
hisat2_3n_mapping_contamination_SE	18.88	24.07	-5.19	-27.48%
hisat2_3n_mapping_genes_SE	17.70	24.23	-6.53	-36.91%
hisat2_3n_mapping_genome_SE	29.02	29.70	-0.68	-2.34%
hisat2_3n_sort	14.27	19.23	-4.96	-34.76%
join_pileup	8.25	8.37	-0.12	-1.45%
prepare_genes_index	0.00	0.02	-0.02	/
stat_mapping_number	0.55	1.55	-1.00	-181.82%
stat_sample_background	1.42	1.48	-0.06	-4.23%

In our initial release, we relied on conda’s pre-built versions of tools such as Samtools, and Bgzip. In an effort to boost performance, our latest version (v0.5) transitioned to manually building these tools with aggressive optimization flags, including HISAT.

Despite expectations of enhanced performance, our benchmarking revealed that several workflow rules experienced negative optimization, with longer runtimes compared to the previous version (v0.25). During our team discussions, several potential causes for these counterintuitive results were identified:

Increased Binary Complexity and Cache Pressure. The aggressive optimization flags, while intended to streamline computations, may have inadvertently increased binary size or complexity. This can lead to less efficient cache utilization on certain hardware architectures, negating the benefits of higher-level optimizations.
Overhead from Link-Time Optimization and Static Linking. While link-time optimization (-flto) and static linking (-static) are designed to reduce runtime overhead, they can also introduce complications. These include increased memory footprint and longer load times, which may counterbalance the improvements in execution speed during runtime.
Incompatibility with Hardware-Specific Characteristics. The conda-distributed versions are often compiled with a balanced set of flags that ensure robust performance across a wide range of systems. In contrast, our manual optimizations (e.g., -march=native) are tailored for specific hardware, which may lead to suboptimal performance on systems where these assumptions do not hold or where the aggressive optimizations cause detrimental instruction scheduling.

In summary, while transitioning from conda-managed packages to manually optimized builds appeared promising in theory, our results indicate that this approach has led to performance regressions in several workflow components. Moving forward, we plan to conduct targeted profiling and benchmarking to refine our compiler flags and build configurations. This iterative process will help us achieve tangible performance improvements while maintaining system stability and cross-platform compatibility.

Fine-tuning of Bioinformatics Tool Parameters

In the rule cutadapt_SE:
1. To determine whether the data is single-end or paired-end, we use the following command: fastq-dump -X 1 --split-spot -Z *.sra | wc -l
  1. A return value of 4 indicates single-end reads.
  2. A return value of 8 indicates paired-end reads.
  3. In our case, all three datasets returned 4, which confirms that the data is single-end.
Since the data is single-stranded, the --auto-rc parameter is unnecessary. This parameter in Cutadapt automatically selects the reverse complement of the adapter sequence if it results in a better match. However, since our data does not contain complementary strands, this functionality is redundant. Therefore, in the cutadapt_SE rule, our team has decided to remove --auto-rc from the shell command to simplify processing and avoid unnecessary computations.
In the rule hisat2_3n_sort:
In the shell command, the parameter -m 3G was originally set for memory allocation. However, after testing in , we found that allocating 3GB per process was insufficient. With 18 threads, the process consumes a significant amount of memory, reaching RSIZE = 54.6GB (resident memory size).
Figure 4.2: Resource Utilization of hisat2_3n_sort_3GB
Given this observation, we hypothesized that 3GB per thread might not be sufficient and that increasing the allocation could improve efficiency. To test this, we adjusted the memory allocation to 10GB (-m 10G) and analyzed the resource utilization again. The results are shown in Figure 4.3.
Figure 4.3: Resource Utilization of hisat2_3n_sort_10G
Upon further analysis, we noticed that VSIZE (virtual memory size) was significantly larger than RSIZE, indicating that not all allocated memory was actively used by each thread. Based on this finding, we estimated that each thread effectively utilizes slightly more than 3GB at peak usage.
Therefore, to strike a balance between efficiency and resource allocation, we decided to set the memory limit to 4GB per thread (-m 4G), ensuring optimal performance without excessive overhead.
In the rule dedup_mapping:
In the shell command, the parameter, -Xmx40G is modified into -Xmx10G. -Xmx is the meaning of… We monitored the hole running time of dedup_mapping and snaped the max used of the RSIZE and find that the peak utilized is 8.6G, which means that allocate 40G is much more for it. 10G is more a propriate way.
In the shell command, the memory allocation parameter -Xmx40G was modified to -Xmx10G. The -Xmx flag specifies the maximum heap size for the Java Virtual Machine (JVM), which determines the upper limit of memory that can be used by the process.
To assess the impact of this change, we monitored the entire runtime of dedup_mapping and captured the maximum RSIZE (resident memory size) used during execution. Our observations showed that the peak memory utilization reached 8.6GB(Figure 4.4), indicating that the initial allocation of 40GB was excessive and left a large amount of unused memory.
Figure 4.4: Resource Utilization of dedup_mapping
Based on this analysis, we determined that allocating 10GB is a more appropriate and efficient choice, as it provides sufficient memory while avoiding unnecessary overhead. This adjustment helps optimize resource usage without compromising performance.
In the rule hisat2_3n_calling_*:
As analyzed earlier, a single sample can utilize up to 36 CPU cores at maximum capacity. However, in pipeline workflows, thread usage can accumulate across multiple steps, leading to excessive resource consumption.
In Unix pipelines (|), all steps execute in parallel, and each tool independently allocates its own threads. For example: samtools -@ 4 ... | hisat3ntable -p 8 ... | bgzip -@ 4 .... The total thread usage in this pipeline is: 4 (samtools) + 8 (hisat3ntable) + 4 (bgzip) = 16 threads.
If each step were assigned -p 16, the total thread count would reach 48, far exceeding the available physical cores. This would result in resource contention, excessive context switching, and degraded performance rather than speed improvements.
To prevent excessive resource usage and optimize CPU allocation, we refined the granularity of thread control as follows:
```
samtools_threads: 1    # Reduce samtools threads (as it is mainly I/O bound)
hisat_threads: 2       # Prioritize core allocation for computational tasks
bgzip_threads: 2       # compression is often input-speed limited
threads: 5             # Overall thread count control
shell:
        """
        {BIN[samtools]} view -@ {params.samtools_threads} -e "rlen<100000" -h {input} | {BIN[hisat3ntable]} -p {params.hisat_threads} -u --alignments - --ref {params.fa} --output-name /dev/stdout --base-change C,T | cut -f 1,2,3,5,7 | {BIN[bgzip]} -@ {params.bgzip_threads} -c > {output}
        """
```
This adjustment ensures efficient resource utilization while avoiding unnecessary performance bottlenecks caused by thread oversubscription.

Attempting to Optimize Python Scripts

To enhance the efficiency of our methylation analysis pipeline for large-scale sequencing datasets, we implemented several key optimizations in the filter_sites.py script.

Lazy Evaluation for Memory Efficiency We adopted Polars’ lazy execution paradigm (scan_ipc) for processing methylation site data. Instead of using pl.read_ipc(), we utilize pl.scan_ipc().collect(), which significantly reduces memory consumption. This approach delays actual computation until collection time, allowing for more efficient memory management when handling large datasets.
```
df_site = (
    pl.scan_ipc(args.input_file)
    .with_columns(
        u=pl.col("unconvertedBaseCount_filtered_uniq"),
        d=pl.col("convertedBaseCount_filtered_uniq") + pl.col("unconvertedBaseCount_filtered_uniq"),
    )
    .with_columns(ur=pl.col("u") / pl.col("d"))
    .collect()  # Trigger calculation
)
```

Vectorized Statistical Computation By utilizing pl.Series and list comprehension instead of map_elements, we process all elements in the entire Series simultaneously rather than applying the function row by row. While the internal implementation still relies on Python loops (via list comprehension and zip), it operates on entire data columns in batches. This approach is often referred to as a “vectorized” or “batched” operation, significantly improving efficiency compared to element-wise function calls.

# before
def testp(successes, trials, p):
    if successes == 0 or trials == 0:
        return 1.0
    return binomtest(successes, trials, p, alternative="greater").pvalue
    
# after
def calculate_pval(u: pl.Series, d: pl.Series, p: float) -> pl.Series:
    from scipy.stats import binomtest
    return pl.Series([
        1.0 if (u_val == 0 or d_val == 0)
        else binomtest(u_val, d_val, p, alternative="greater").pvalue
        for u_val, d_val in zip(u, d)
    ])

Efficient Null Handling We use .fill_null(0) explicitly instead of relying on implicit null handling in the original code. This ensures consistency, prevents potential errors, and improves performance by avoiding unnecessary element-wise checks.
```
.with_columns(
    pl.col("u").fill_null(0),
    pl.col("d").fill_null(0)
)
```
Schema-Aware Data Loading Explicitly defining the schema for mask file parsing ensures consistent data types, reduces type inference overhead, and improves processing efficiency.
```
schema_overrides={"ref": pl.Utf8, "pos": pl.Int64, "strand": pl.Utf8}
```
By specifying column data types in advance, we minimize potential type mismatches and optimize data loading performance.

AFTER analyse and discuess, we put it into practive as branch v0.5. the result is below Figure 4.5

Figure 4.4: Resource Utilization of dedup_mapping

Before optimization, the real time was 0m37.227s, while after optimization, it was reduced to 0m33.145s—a decrease of approximately 4 seconds. Real time represents the actual elapsed time from start to finish, indicating an overall improvement in execution efficiency.

The user time before optimization was 3m28.248s, increasing slightly to 3m28.993s after optimization, reflecting a minor 0.745-second increase. User time measures the CPU time spent in user space. This small increase suggests that while computational workload in user space may have grown slightly, the impact is negligible.

The sys time, which represents CPU time spent in kernel space, showed a significant improvement. Before optimization, it was 7m45.517s, which decreased to 7m1.135s, resulting in a reduction of approximately 44 seconds. This indicates that system-level operations, such as I/O handling and system calls, were notably optimized.

The optimization resulted in a reduction in both real and sys time, leading to faster execution and lower system overhead. Although user time experienced a minor increase, the overall efficiency gain outweighs this, demonstrating a successful optimization.

Results

Evaluation of Detection Accuracy and Correlation

Since the participants do not have access to the actual true.tsv file, we use the benchmark test results as a reference for evaluation. Given that we have not made substantial parameter modifications in the Snakefile—except for minor adjustments to ensure successful execution, as detailed in Methods/Snakefile—the benchmark serves as a reliable substitute for the ground truth.

To systematically assess detection accuracy and correlation, we developed a script (scripts/metrics.sh). This script allows us to verify that any parameter adjustments we make in subsequent optimization steps do not compromise the accuracy and reliability of the detection process.

Figure 4.5: Evaluation

Performance Optimization and Computational Efficiency

To evaluate the impact of our optimizations, we compared the total runtime before and after the modifications. The initial execution time was 03:17:01, while the optimized workflow completed in 02:20:46, indicating a significant performance improvement.

Performance Gain Calculation.
1. To quantify the speedup, we calculate the time reduction percentage using the formula:
  $\text{Improvement (\%)} = \left( \frac{\text{Time Before} - \text{Time After}}{\text{Time Before}} \right) \times 100$
2. Applying this to our case:
  $\text{Improvement (\%)} = \left( \frac{3 \times 3600 + 17 \times 60 + 1 - (2 \times 3600 + 20 \times 60 + 46)}{3 \times 3600 + 17 \times 60 + 1} \right) \times 100 = 28.55\%$
The resulting percentage reflects the overall efficiency gain achieved through our optimizations.
Breakdown of Optimization Effects. To further understand the sources of improvement, we categorized our optimizations into the following key factors:
1. Parallelization and Thread Allocation: Fine-tuned threading strategies reduced contention and improved CPU utilization.
2. Memory Management: Adjusted memory allocation parameters to prevent excessive memory usage while maintaining efficiency.
3. Pipeline Bottleneck Reduction: Eliminated redundant computations and optimized intermediate steps.

Discussion

Potential Strategies for Further Optimization of Snakefile Execution Time

To further optimize the execution time of the Snakefile, atop was used to monitor resource utilization. The corresponding log files are stored in the project root directory under workspace/.snakemake/log/atop_snapshots, and can be reviewed with the atop -r command. This allows users to analyze potential resource bottlenecks. However, perf might provide more targeted and in-depth insights, especially for diagnosing performance issues related to CPU and I/O behavior. This could be a more effective approach than relying on atop alone.

Resource Allocation Optimization
In the Snakefile, the resources directive was used to increase resource grouping, allowing for more precise control over resource distribution. A three-level scheduling strategy was implemented:
```
# Snakefile example
rule cutadapt_SE:
    resources:
        stage1=1,  # Each task consumes 1 stage1 unit
        mem_mb=32768
    threads: 36

# Command-line adjustment
snakemake --cores 108 --resources 
  stage1=3 stage2=9 stage3=14 misc=8  # Resource allocation by execution stage
```
Stage Division and Thread Allocation:
This table illustrates the division of tasks by execution stage, with a specific allocation of threads and cores to each stage, ensuring more efficient resource distribution.
Table 4.6: Resource Allocation Management Concept

Stage	Threads/Task	Rules	Parallel	Total Cores
1	36	cutadapt_SE	3	108
2	10	hisat2_mapping*	9	90
3	5	hisat2_calling*	14	70
4	4	Stats tasks	8	32

Integrate a Database with Bioinformatics Tools like Samtools and HISAT
Incorporating a database into workflows that utilize bioinformatics tools like Samtools and HISAT offers significant advantages in terms of data management, reproducibility, scalability, and efficiency. A database can optimize large-scale genomic data processing by enabling faster data retrieval, ensuring data integrity, and supporting effective resource management. This integration ultimately enhances the performance and usability of bioinformatics pipelines, allowing for better handling of complex datasets and accelerated research.
- Structured Storage of Large Datasets: Bioinformatics workflows often generate vast amounts of data, such as aligned reads, variant calls, and methylation sites. Storing these intermediate results and associated metadata in a structured database facilitates efficient management. Relational databases such as PostgreSQL or SQLite can support complex queries and indexing, optimizing data retrieval and manipulation.
- Faster Data Access: Databases are optimized for quick data access. Storing critical data (e.g., alignment summaries or variant information) in a database enables users to query subsets of data without needing to reprocess large files, reducing overall execution time.
- Tracking Metadata and Intermediate Results: A database can store metadata related to the analysis pipeline (e.g., parameters used, tool versions, timestamps) and intermediate results. This improves reproducibility by allowing researchers to trace every step of the analysis, making it easier to rerun or validate the analysis with the same parameters.
- Auditability: Databases make it easier to track changes, which is crucial for auditing the workflow. This feature is particularly valuable in regulatory or clinical environments where reproducibility and validation of results are essential.
- Cross-Platform Compatibility: A well-structured database can facilitate data sharing and integration across various platforms and teams, supporting collaboration and the seamless exchange of results.
Our optimization plan involves integrating databases into bioinformatics workflows to streamline the processing and management of large datasets, ultimately enhancing the performance and scalability of the analysis pipelines.

Discussion of Unsuccessful Optimization Attempts

hisat-3n_lessConversion
While considering swapping tools (such as replacing bgzip with pigz) to improve performance, I found that the output files were not compatible with subsequent stream processing. During this process, I came across the hisat-3n_lessConversion branch on the hisat GitHub page. This branch seemed to hold potential for optimization, so I decided to compile and test it in branch v0.52, hoping to achieve better performance.
While testing the hisat-3n_lessConversion tool in v0.52, I found that the whole time-consuming were quite similar to the original HISAT version. Specifically, the total CPU time was 13.68 hours, and the wall time was 2 hours, 21 minutes, and 54 seconds. Both of these metrics were essentially unchanged compared to the previous version. The accuracy and relevance of the results, when compared to the baseline test, also remained at a consistently high level.
Unfortunately, despite the efforts to optimize performance, there was no reduction in processing time. This outcome led us to reflect on the changes made in the branch. We considered that the adjustments in the hisat-3n_lessConversion branch might not have addressed the underlying bottlenecks in our workflow or that the specific optimizations introduced were not as impactful as anticipated. Ultimately, we concluded that further investigation or a different approach might be necessary to achieve the desired performance improvements.

Optimizing hisat2_3n_calling_unfiltered_unique Performance with tmpdir="/dev/shm"

rule hisat2_3n_calling_unfiltered_unique:
    input:
        INTERNALDIR / "aligned_bam/{sample}.{ref}.bam",
    output:
        temp(TEMPDIR / "unfiltered_unique/{sample}.{ref}.tsv.gz"),
    params:
        fa=lambda wildcards: (
            REF[wildcards.ref]["fa"]
            if wildcards.ref != "genes" or not CUSTOMIZED_GENES
            else "prepared_genes/genes.fa"
        ),
        samtools_threads=1,     
        hisat_threads=2,       
        bgzip_threads=2,        
        tmpdir="/dev/shm", # add
    threads: 5
    shell: # add export TMPDIR
        """
        export TMPDIR={params.tmpdir} && {BIN[samtools]} view -@ {params.samtools_threads} -e "rlen<100000" -h {input} | stdbuf -o 1M {BIN[hisat3ntable]} -p {params.hisat_threads} -u --alignments - --ref {params.fa} --output-name /dev/stdout --base-change C,T | cut -f 1,2,3,5,7 | {BIN[bgzip]} -@ {params.bgzip_threads} -c > {output}
        """

We hypothesized that setting tmpdir="/dev/shm" in the HISAT pipeline could yield significant performance benefits, particularly when processing large datasets.

Since /dev/shm is a RAM-based filesystem (tmpfs), it operates entirely in memory rather than relying on disk storage, enabling significantly faster read and write operations. Moreover, by redirecting temporary file operations to /dev/shm, the system can maintain higher responsiveness, mitigating excessive disk I/O contention that might otherwise degrade overall performance.

In branch v0.6, we conducted a test to evaluate the impact of setting tmpdir="/dev/shm" in the HISAT pipeline, particularly focusing on performance improvements when processing large datasets. The results, summarized in Table 4.7, compare execution time before and after the change.

Table 4.7: Performance Comparison Before and After Using /dev/shm as tmpdir

Pipeline Step	Job_count	Time (min)	Total_time(min)	Threads
Before (tmpdir on disk)
hisat2_3n_calling_unfiltered_unique	6	25.61	153.68	5
After (tmpdir=“/dev/shm”)
hisat2_3n_calling_unfiltered_unique	6	25.48	152.87	5

The improvement amounts to approximately 0.13 minutes per job (0.5% reduction). The slight decrease in execution time suggests that relocating temporary file operations to /dev/shm may have helped reduce I/O overhead. However, the modest improvement indicates that disk I/O was not a significant bottleneck in this case.

This could be considered an unsuccessful optimization attempt, as the expected performance gains were not realized.

Conclusion

In this study, with an overall optimization of 28.55%, these advancements significantly enhance efficiency. Specifically, we achieved an 11.78% optimization in the execution time of the hisat2_3n_calling rule and an 18.75% improvement in the hisat2_3n_mapping rule through targeted refinements to shell parameters and resource allocation. However, these rules still remain major contributors to the overall workflow runtime, accounting for a significant portion of computational overhead(Table 4.7).

Table 4.8: Total process time analysis

rule_name	job_count	threads	Time (%)
combine_runs	9	8	0.32%
cutadapt_SE	3	36	2.21%
dedup_mapping	6	3	9.27%
extract_unmap_bam_internal_SE	9	4	0.33%
group_pileup	2	6	1.72%
hisat2_3n_calling_filtered_multi	6	5	17.08%
hisat2_3n_calling_filtered_unqiue	6	5	17.31%
hisat2_3n_calling_unfiltered_multi	6	5	17.18%
hisat2_3n_calling_unfiltered_unique	6	5	16.86%
hisat2_3n_filtering	6	4	0.90%
hisat2_3n_mapping_contamination_SE	3	10	4.19%
hisat2_3n_mapping_genes_SE	3	10	3.93%
hisat2_3n_mapping_genome_SE	3	28	6.44%
hisat2_3n_sort	9	4	1.06%
join_pileup	6	6	0.92%
prepare_genes_index	1	12	0.00%
stat_mapping_number	3	4	0.12%
stat_sample_background	6	2	0.16%

Therefore, in the future, we plan to further investigate the hisat2_3n_calling, hisat2_3n_mapping, and dedup_mapping rules, conducting a comprehensive scientific analysis to identify additional optimization opportunities. We hope that, with further refinements in resource allocation and parameter adjustments, the overall workflow runtime can be further reduced.

References:

[1] Song, H., Zhang, J., Liu, B., Xu, J., Cai, B., Yang, H., Straube, J., Yu, X., & Ma, T. (2022). Biological roles of RNA m5C modification and its implications in cancer immunotherapy. Biomarker Research, 10, Article 15. https://biomarkerres.biomedcentral.com/articles/10.1186/s40364-022-00362-8

[2] Tuorto, F., Liebers, R., Musch, T., Schaefer, M., Hofmann, S., Kellner, S., Frye, M., Helm, M., Stoecklin, G., & Lyko, F. (2012). RNA cytosine methylation by Dnmt2 and NSun2 promotes tRNA stability and protein synthesis. Nature Structural & Molecular Biology, 19, 900–905. https://www.nature.com/articles/nsmb.2357