Development of sample mismatch detection algorithm in genome sequencing cohorts

YUHSpace

BROWSE

238 395

Cited 0 times in

Development of sample mismatch detection algorithm in genome sequencing cohorts

Other Titles: 유전체 데이터 코호트 내 시료 불일치 검출 방법 개발

Authors: 전혜인

College: College of Medicine (의과대학)

Degree: 석사

Issue Date: 2019

Abstract: Over the past decade, the incredible development of next generation sequencing (NGS) technology has expanded clinical research using NGS data. NGS enabled large genomic study at low cost. However, expanded use of NGS requires a large number of samples to be processed in a limited time. Therefore, accompanied human errors in sample handling remain constant concerns. Sample mismatch in the process of NGS is one of the frequent problem that can cause an entire genomic analysis to fail. Therefore, a regular cohort-level sample match checkup is needed to ensure that it has not occurred. However, currently developed tools require additional processing besides the main analysis or huge time for a large dataset or show lower performance with targeted sequencing data than other larger target size of sequencing data. We developed a new, automated tool (BAMixChecker) that accurately detects sample mismatches from a given BAM file cohort with minimal user intervention. BAMixChecker compares samples only with 853 well-mappable and frequently mutatable single-nucleotide polymorphisms (SNP) loci for whole genome sequencing (WGS), whole exome sequencing (WES), RNA-seq dataset. BAMixChecker uses a flexible, data-specific set of SNPs with target region information from BED file for targeted sequencing data. BAMixChecker detects orphan (unpaired) and swapped (mispaired) samples based on genotype-concordance score and entropy-based file name analysis. BAMixChecker shows ~100% accuracy in real WES, RNA-seq, and targeted sequencing data cohorts, even for small panels (<50 genes). BAMixChecker also provides an HTML-style report, with which users can quickly inspect any mismatch events.

Next-generation sequencing (NGS)를 이용한 기술이 발전함 따라 여러 연구에서 다수의 NGS 데이터를 이용한 연구가 확대되고 있 다. 이러한 NGS 데이터를 생산하는 과정에서 발생하는 동일 개체 유 래 시료의 불일치는 전체 유전체 분석 결과에 영향을 줄 수 있는 문 제 중 하나이다. 시료 수집부터 실험 과정에서나 bioinformatics 분 석 과정에서 다수의 실험자 혹은 연구자들을 거치게 되면서 각 개인 에서 유래한 시료들이 섞이거나 mislabeling 될 가능성이 있다. 그러 므로 코호트 단위의 전체적인 시료의 일치 여부를 확인하는 것이 필 요하다. 그러나 현재 개발되어 있는 프로그램은 일반적인 NGS 데이터 분석 과정에 추가적인 전처리 혹은 해석을 위한 후처리 등의 추가적 인 과정을 필요로 하거나, 다수의 데이터 분석에 많은 시간이 걸리거 나, targeted sequencing 데이터 등 target 영역이 작은 데이터에서 낮은 정확도를 보이는 등의 실제 분석 단계에서 활용에의 어려움이 있다. 본 연구에서는 사용자가 기본적인 NGS 분석 과정에서 사용할 수 있는 BAM file 코호트를 이용하여 정확하고 빠르게 시료의 불일치 를 검출해낼 수 있는 자동화된 프로그램인 BAMixChecker를 개발하였 다. BAMixChecker는 오직 853개의 mapping이 잘 되며 자주 mutation 이 되는 single-nucleotide polymorphisms (SNP) 영역을 비교하여 whole genome sequencing (WGS), whole exome sequencing (WES), RNAseq 코호트 시료를 비교한다. 그리고 targeted sequecing의 경우 BED file을 추가적으로 입력 받아 해당 타겟 영역에 맞는, 개별 코호트 특이적인 SNP set을 구성한다. 이렇게 정해진 비교 영역에서의 체세 포 변이 정보와 함께 파일명 유사도 분석 알고리즘 기반 혹은 사용자 입력 기반 시료 쌍 정보를 바탕으로 시료의 불일치 여부를 분석한다. 불일치 시료는 경우에 따라, 어떤 시료와도 유전 정보로 일치하지 않 는 시료의 경우 "Orphan", 유전 정보와 파일명 기반 혹은 사용자 입 력 정보 기반 시료 쌍 정보가 일치하지 않는 경우 "Swapped" 시료로 분류하여 보고 된다. 분석 결과는 한 눈에 결과를 확인할 수 있는 HTML 파일과 함께 리눅스 환경에서 결과를 바로 확인할 수 있는 TXT 파일을 통해 제공된다. 해당 알고리즘은 50개 이하의 panel data를 포함한 실제 WES, RNA-seq, and targeted sequencing 코호트에서 ~100% 의 정확도를 보였다.

Files in This Item:: TA02238.pdf Download

Appears in Collections:: 1. College of Medicine (의과대학) > Others (기타) > 2. Thesis

URI: https://ir.ymlib.yonsei.ac.kr/handle/22282913/178179

사서에게 알리기

Show full item record Find it @ YMLIB

License

YUHSpace: Development of sample mismatch detection algorithm in genome sequencing cohorts

YUHSpace

BROWSE

Browse

Links