Hello!
I am trying to run GenotypeGVCFs on a set of 12 pooled-seq (50 fish per pool, using a ploidy of 20 in GATK) whole genome samples. Unfortunately I have had recurrent java memory issues ( I think!) and this one error that I can't figure out. I posted a version of this question originally on gatk/discussion/13430 and didn't pursue it because I thought it was tied to other problems I was having with GenomicsDBImports. Since then I have tried a couple things, but still get the error.
Here is a little bit about my analysis so far: I have pre-processed bam files (according to GATK best practices) for each of my 12 pools. I'm working on a non-model organism, so I don't have a full genome to align my reads to, just 24 linkage groups and 8626 scaffolds. The number of scaffolds has presented some issues, so at times along my pipeline I have split the analysis into two tracks, keeping the linkage groups together and the scaffolds together. I start with HaplotypeCaller scatter/gathering over the linkage groups then using -L and treating the scaffolds as intervals. Then MergeGVCFs to combine all the GVCFs into one GVCF file per pool. I use LeftAlignAndTrimVariants to left align and trim the variants (I was having a problem with GenomicsDB where there were variants with a ton of PLs and at it was preventing me from importing to the GenomicsDB, this seems to have solved that problem). Then I use GenomicsDBImport to make a GenomicsDB for each of my linkage groups, with all my pools included. I'm using CombineGVCFs to combine all the GVCFs for the scaffolds.
This is where I run into trouble. I have made a GenomicsDB for each of my 24 linkage groups without issue, but to run GenotypeGVCFs on the GenomicsDB and get genotypes I have to split the linkage group into intervals. I can get some intervals of just under ~2million base pairs to genotype with this command and runtime:
command <<<
set -e
tar -xf ${workspace_tar}
WORKSPACE=$( basename ${workspace_tar} .tar)
"/gatk/gatk" --java-options "-Xmx125g -Xms125g" \
GenotypeGVCFs \
-R ${ref_fasta} \
-O ${output_vcf_filename} \
--disable-read-filter NotDuplicateReadFilter \
-V gendb://$WORKSPACE \
--verbosity ERROR \
-L ${interval}
>>>
runtime {
docker: "broadinstitute/gatk:4.1.0.0"
memory: "150 GB"
cpu: "4"
disks: "local-disk " + disk_size + " HDD"
preemptible: preemptible
}
But some intervals fail with the following error, which I have been treating like an out of memory error, and continuously bumping up the -Xmx, -Xms, total memory and the disk space (up to 500G), though I'm not sure which would help- I just feel desperate to get them genotyped.
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007fb8e3875edf, pid=19, tid=0x00007fdefaaed700
#
# JRE version: OpenJDK Runtime Environment (8.0_191-b12) (build 1.8.0_191-8u191-b12-0ubuntu0.16.04.1-b12)
# Java VM: OpenJDK 64-Bit Server VM (25.191-b12 mixed mode linux-amd64 )
# Problematic frame:
# C [libtiledbgenomicsdb571603236969900509.so+0x17dedf] void VariantOperations::reorder_field_based_on_genotype_index<int>(std::vector<int, std::allocator<int> > const&, unsigned long, MergedAllelesIdxLUT<true, true> const&, unsigned int, bool, bool, unsigned int, RemappedDataWrapperBase&, std::vector<unsigned long, std::allocator<unsigned long> >&, int, std::vector<int, std::allocator<int> > const&, unsigned long, std::vector<int, std::allocator<int> >&)+0x6f
#
# Core dump written. Default location: /cromwell_root/core or core.19
#
# An error report file with more information is saved as:
# /cromwell_root/hs_err_pid19.log
#
# If you would like to submit a bug report, please visit:
# http://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
Following the examples in this thread: gatk/discussion/comment/43887, I have also tried adding -XX:+UseSerialGC and using --use-jdk-deflater true and --use-jdk-inflater true, and using the most recent GATK: 4.1.1.0. Those have not helped.
Do you know why I might be getting this error? Should I continue to increase the memory/disk space, or reduce the size of the interval, or is there something else that could be wrong?
The other error that I get is maybe more cryptic:
java: /home/vagrant/GenomicsDB/dependencies/htslib/vcf.c:3641: bcf_update_format: Assertion `nps && nps*line->n_sample==n' failed.
Using GATK jar /gatk/gatk-package-4.1.0.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx125g -Xms125g -jar /gatk/gatk-package-4.1.0.0-local.jar GenotypeGVCFs -R /cromwell_root/uw-hauser-pcod-gatk-tests/gadMor2.fasta -O LG01.3.Genotyped.vcf.gz --disable-read-filter NotDuplicateReadFilter -V gendb://genomicsdb_LGs --verbosity ERROR -L LG01:6000001-8000000
That one seems to be tied to a specific location in the genome. Though this run with the interval LG01:6000001-8000000 failed, it produced part of a VCF file. I made two smaller intervals from the original interval to see if avoiding the area around where the VCF file stopped would help it finish running. My two new intervals were: LG01:6000001-7134350 and LG01:7150000-8000000. The first interval worked fine, but the second gave me the same error and no VCF file.
Do you have any ideas what might be happening here, and what bcf_update_format: Assertion `nps && nps*line->n_sample==n' failed means? I don't need to genotype every location- if some are giving me issues because they are too messy I am ok dropping them. I would just like to make sure that I am not inadvertently introducing errors in my data, or ignoring some warning flags for issues I need to address in the analysis. And, with so many linkage groups split up into intervals, it is tough to have a pipeline that requires manual tinkering when intervals fail.
Thank you so much for any advice you have!