Top 7 PyElph Features Every Genetic Genealogist Should KnowPyElph is a lightweight, command-line-focused toolkit designed to automate common genetic genealogy file-handling tasks. It’s particularly popular with genealogists who work directly with raw DNA data files (GEDmatch, FamilyTreeDNA, 23andMe, Ancestry export formats and others) and who want reproducible, scriptable workflows. Below I explain the seven features that deliver the most value, with practical examples, tips, and warnings so you can put each feature to work right away.
1) File format detection and normalization
PyElph recognizes and normalizes a wide range of raw DNA file formats. Instead of manually opening and inspecting vendor-specific headers or column orders, PyElph can parse and reformat files into a predictable, consistent structure.
Why it matters
- Saves time when ingesting data from multiple sources.
- Reduces errors due to mismatched column names or ordering.
Practical example
- Convert an Ancestry VCF/CSV export and a 23andMe raw file into a single normalized CSV with consistent columns for sample ID, chromosome, position, rsID, and genotype.
Tips
- Always run detection on a copy of original files.
- Check the log output after normalization to confirm the sample IDs and SNP counts match expectations.
2) SNP filtering and QC (quality control)
PyElph provides flexible SNP filtering options—by chromosome, genomic range, call rate, or custom SNP lists—and basic QC checks like missingness rates and heterozygosity for autosomal SNPs.
Why it matters
- Helps remove noisy or low-quality markers that can skew match computations.
- Lets you apply the same QC thresholds consistently across batches.
Practical example
- Exclude all SNPs with >5% missing genotype calls, remove mitochondrial and Y-chromosome-only SNPs when working on autosomal analyses, and filter to a curated SNP panel used for matching.
Tips
- Use conservative thresholds for initial runs; tighten them if you observe excessive false positives.
- Keep a record of SNP counts before and after filtering for reproducibility.
3) Strand alignment and allele flipping
Different vendors and reference builds sometimes report genotypes on opposite DNA strands. PyElph can detect strand mismatches against a chosen reference and flip alleles as needed so all datasets use the same strand orientation.
Why it matters
- Prevents false mismatches caused solely by strand differences.
- Essential before merging datasets or computing IBS/IBD.
Practical example
- Align multiple datasets to the GRCh37 reference alleles; flip A/T or C/G SNPs only when strand-aware checks indicate a mismatch.
Tips
- Always supply the reference build used in your downstream analyses (e.g., GRCh37/hg19).
- Pay special attention to palindromic SNPs (A/T and C/G); some pipelines remove them rather than risk ambiguity.
4) Sample merging and duplicate resolution
PyElph simplifies merging multiple samples into a single dataset and can detect duplicates or sample swaps through genotype similarity metrics.
Why it matters
- Makes combining results from different testing companies straightforward.
- Helps catch labeling errors or duplicated uploads.
Practical example
- Merge a set of 23andMe and Ancestry files into one matrix, identify pairs of samples with >99.9% concordance as probable duplicates, and flag discordant IDs for manual review.
Tips
- When duplicates are found, compare metadata (name, email, upload date) to decide which sample to keep.
- Generate a concordance report for audit trails.
5) Subsetting and chromosome-level exports
Need only specific regions or chromosomes? PyElph can export per-chromosome files or slice out genomic ranges (for example, a candidate IBD segment) into new, smaller files for targeted analysis.
Why it matters
- Focuses compute and memory resources on regions of interest.
- Simplifies sharing small, relevant subsets with collaborators.
Practical example
- Extract chromosome 5, positions 50,000,000–60,000,000 for fine-scale phasing or IBD validation and produce a VCF or CSV limited to that interval.
Tips
- Keep consistent coordinate systems (base-1 vs base-0) between tools.
- When extracting segments, also export marker positions to maintain traceability.
6) Command-line scripting and reproducible pipelines
PyElph is designed for scripts and pipelines. Each operation has flags and parameters suitable for non-interactive, repeatable runs so you can include them in Bash, Python, or workflow managers (Nextflow, Snakemake).
Why it matters
- Enables reproducible research and automation of repetitive tasks.
- Facilitates batch processing for large projects or public outreach initiatives.
Practical example
- Create a Snakemake rule that normalizes incoming uploads, runs QC filters, aligns strands, and produces a merged dataset, all triggered when new raw files appear in a directory.
Tips
- Version-control your pipeline scripts and include the PyElph version used.
- Log parameter sets and outputs for each run to support reproducibility.
7) Lightweight reporting and logging
PyElph emits concise logs and summary reports (sample counts, SNP counts, missingness, flagged issues). These outputs are designed to be human-readable and script-friendly (CSV/JSON) for downstream dashboards or audits.
Why it matters
- Makes it easy to spot anomalies quickly and to feed results into other tools.
- Supports documentation and transparency when sharing data with collaborators.
Practical example
- After a batch run, produce a JSON summary containing numbers of samples processed, SNPs filtered, duplicates found, and any strand flips performed; feed that JSON into a web dashboard.
Tips
- Pipe logs to a centralized log-collector if running large pipelines.
- Include checksums of input files in reports to verify provenance.
Putting the features together: a short workflow example
- Normalize all incoming raw files to the same CSV format.
- Run SNP QC filtering (missingness threshold 5%).
- Align strands to GRCh37 and remove ambiguous palindromic SNPs.
- Merge samples and detect duplicates; resolve conflicts.
- Export per-chromosome matrices for IBD detection tools.
- Save a JSON run-report with counts, QC metrics, and file checksums.
Final note on best practice
- Always retain original raw files and document each processing step. Genetic genealogy conclusions rely on reproducible handling of sensitive data; clear logs and conservative QC help maintain trust in your results.
Leave a Reply