Introduction
When you submit a genome assembly, or NCBI assembles the reads you submitted, it ends up in Genbank. If the assembly is of sufficient quality, it is annotated with PGAP and added to Refseq. Note that this means the same assembly can exist in Genbank (with your original annotation) and in Refseq with the PGAP annotation and a different accession number.
There are many reasons why a Genbank assembly could be denied admission to Refseq, and here I outline how to find out if the genome you are working with is potentially bad, and why.
Commands
wget https://ftp.ncbi.nlm.nih.gov/genomes/genbank/assembly_summary_genbank.txt pip3 install csvkit csvcut -t -K 1 -c 'excluded_from_refseq' assembly_summary_genbank.txt \ | tail -n +2 | tr ";" "\n" \ | sed -e 's/^ //' -e 's/ $//' | grep -v '""' \ | sort | uniq -c | sort -nr
Results
198215 derived from surveillance project 32538 derived from metagenome 16820 derived from environmental source 10389 metagenome 4684 low contig N50 2187 partial 1822 many frameshifted proteins 1568 derived from single cell 856 genome length too large 820 genome length too small 752 high contig L50 562 low quality sequence 357 contaminated 352 missing tRNA genes 309 abnormal gene to sequence ratio 274 validation errors 194 missing ribosomal protein genes 114 missing rRNA genes 104 untrustworthy as type 77 unverified source organism 38 misassembled 12 mixed culture 6 chimeric 4 low gene count 2 hybrid
Conclusion
Stick to using genomes from Refseq wherever possible. The main exception may be if you are working in public health microbiology, then the "derived from surveillance project" reason probably means it is part of GenomeTrakr and might still be of importance.