| Title: | Relationship-Informed Pedigree and Variant Scoring |
|---|---|
| Description: | Comparative evaluation of families and candidate variants in rare-variant association studies. The package can be used for two methodologically overlapping but distinct purposes. First, the prior to any genetic or genomic evaluation, evaluation of relative detection power of pedigrees, can direct recruitment efforts by showing which individuals not yet sampled would be the most meaningful additions to a study. Second, after sequencing and analysis, variants based on association with disease status and familial relationships of individuals, aids in variant prioritization. Methodology is described in Nugent (2025) <doi:10.1101/2025.10.06.25337426>. |
| Authors: | Cameron M. Nugent [aut, cre] (ORCID: <https://orcid.org/0000-0002-1135-2605>) |
| Maintainer: | Cameron M. Nugent <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.1.2 |
| Built: | 2026-05-19 07:42:05 UTC |
| Source: | https://github.com/sequencebio/kinformr |
Sum all the given scores and return a single vector with cumulative "score", "for" and "against" vals. For use in instances where one wishes to combine scores from multiple families.
add.fam.scores(score.vec)add.fam.scores(score.vec)
score.vec |
A vector will all of the per family score outputs. |
A vector with the summed scores of all inputs.
score.fam1 <- c("score" = 1.0, "score.for" = 2.0, "score.against" = 1.0) score.fam2 <- c("score" = 1.0, "score.for" = 3.0, "score.against" = 2.0) out <- add.fam.scores(c(score.fam1, score.fam2))score.fam1 <- c("score" = 1.0, "score.for" = 2.0, "score.against" = 1.0) score.fam2 <- c("score" = 1.0, "score.for" = 3.0, "score.against" = 2.0) out <- add.fam.scores(c(score.fam1, score.fam2))
For affecteds: Take genetic variant and determine the category of the combo.
assign.a(variant)assign.a(variant)
variant |
Variant for individual. genotypes, phased genotypes, or binary encodings accepted. |
Take a disease status and a genetic variant and determine which category the combo falls in. A.c = Affected individual with ALT variant A.i = Affected individual without ALT variant U.c = Unaffected individual without ALT variant U.i = Unaffected individual with ALT variant If theoretical.max = TRUE the true variant statuses are ignored and all affected/unaffected are assigned A.c and U.c respectively. These encoding can then be used show what a family's max score would be.
assign.status(status, variant, theoretical.max = FALSE)assign.status(status, variant, theoretical.max = FALSE)
status |
Disease status of an individual. A = affected, U = unaffected. |
variant |
Variant for individual. genotypes, phased genotypes, or binary encodings accepted. |
theoretical.max |
Should the theoretical maxima be returned instead of the observed values? When true, the scoring assumes correct variant-status pair for each individual. Default is FALSE. |
a string
assign.status("A", "0/1") == "A.c" assign.status("A", "0|0") == "A.i" assign.status("U", 1) == "U.i" assign.status("U", "0|0") == "U.c"assign.status("A", "0/1") == "A.c" assign.status("A", "0|0") == "A.i" assign.status("U", 1) == "U.i" assign.status("U", "0|0") == "U.c"
For unaffecteds: Take a genetic variant and determine the category of the combo.
assign.u(variant)assign.u(variant)
variant |
Variant for individual. genotypes, phased genotypes, or binary encodings accepted. |
Build dictionary with the relationships falling in the different categories for the query row.
build.relation.dict(mat.row, name.stat.dict, drop.unrelated = TRUE)build.relation.dict(mat.row, name.stat.dict, drop.unrelated = TRUE)
mat.row |
A row from a relationship matrix |
name.stat.dict |
A list with the labelled status/variant combo for each individual. |
drop.unrelated |
Should unrelated (-1) relationships be dropped? Default = TRUE. |
A list with the categorized relationship/variant information.
These scores can be used to compare variants of interest within a family.
calc.rv.score( fam.list, affected.weight = 1, unaffected.weight = 0.5, unaffected.max = 8, max.err = 4 )calc.rv.score( fam.list, affected.weight = 1, unaffected.weight = 0.5, unaffected.max = 8, max.err = 4 )
fam.list |
|
affected.weight |
A coefficient to multiply the calculated A.c and A.i relatedness values by. |
unaffected.weight |
A coefficient to multiply the U.c and U.i relatedness values by. |
unaffected.max |
This param controls the score given to a first degree unaffected relatives scores decay from this specified maximum by half for each subsequent relationship degree. |
max.err |
A heuristic cap of the number of incorrect assignments allowed when scoring. When the total number of incorrect (sum of affected and unaffected) is exceeded, the variant's score is set to 0, regardless of the number of points for or against. This simplifies scoring and allows for fast filtering of poor quality variants. Default is 4. |
For each encoded relationship, a relationship-informed weight is applied to their sharing or not sharing of a variant. The score for affected status is: (1 / coefficient.of.relatedness) * status.weight For example, an affected cousin (encoded as a 3) would get a score of: (1/0.125) * affected weight 8 * 1 = 8 points in favour of the variant. Whereas for unaffected individuals, scores decay the further a person is in relation to the proband based on the formula: ((unaffected.max2) * coefficient.of.relatedness ) * unaffected.weight For example, with the default unaffected.max of 8. The sister that does not have a variant would get a score of ((82) 0.5) * unaffected.weight (16 * 0.5) * 0.5 = 4 points for the variant. If these were the only two relatives considered we could sum the points and get a score in favour of the variant of 8 + 4 = 12 If there is evidence against a variant, this is factored into the score as: total.score = evidence.for - evidence.against For example, if there were also an affected sibling without the variant we would have the score against of: (1/0.5) * 1 = 2 The final score for the variant would then be for - against = total 12 - 2 = 10 Giving a final score of 10 for the variant. Comparing values across variants can be used to rank them based on pedigree-informed levels of variant sharing across affected and unaffected individuals. Increasing the affected.weight relative to the unaffected.weight will make the scores give more weight to the correct/incorrect status of affected individuals. The default is 2:1 weight for affected relative to unaffected, which accounts for the fact that variants are likely to be incompletely penetrant and we therefore want to be more tolerant of unaffected individuals that have a variant rather than affected individuals that do not.
Input:
A list with three components: score, score.for, score.against.
relations <- list("A.c" = c(0, 1, 3, 1), "A.i" = c(3), "U.c" = c(1, 2), "U.i" = c(1)) rv.scores <- calc.rv.score(relations)relations <- list("A.c" = c(0, 1, 3, 1), "A.i" = c(3), "U.c" = c(1, 2), "U.i" = c(1)) rv.scores <- calc.rv.score(relations)
Take the relationship matrix and the encoded statuses of info. For each row, generate the encoded data for scoring.
encode.rows(relation.mat, status.df, ...)encode.rows(relation.mat, status.df, ...)
relation.mat |
The relationship matrix for all pairwise combinations of individuals. |
status.df |
The ID, status, and genotypes for each individual. |
... |
Additional arguments to be passed between methods. |
A dictionary with the per-individual relationship lists. One value for each row of the matrix.
Use the relationship informationfrom the pedigree to estimate of the amount of the genome they have inherited it from a common ancestor without recombination.
ibd(a, b, c, d, n, K, theoretical = TRUE)ibd(a, b, c, d, n, K, theoretical = TRUE)
a |
Count of affected individuals |
b |
Count of obligate carriers |
c |
Count of children of either affecteds or carriers, with no children of their own |
d |
Count of Trees of unaffected individuals - specifically, two sequential generations (i.e. a parent and their offspring) |
n |
Count of the number of second generation progeny in a given tree. |
K |
The estimate of penetrance rate. |
theoretical |
Boolean indicating if the calculation should be theoretical IBD calculation (using only d and k), or if the calculation should use the provided n value. |
Can do this for the total potential relatedness in a pedigree (theoretical=TRUE), or for the actual relatedness across collected samples (theoretical=FALSE). For the theoretical=TRUE case, in the unaffected trees, if we have a sample from the parent, then the offspring do not provide any additional information for a max IBD calculation. This means that K does not scale with n.
For theoretical=FALSE, sometimes we don’t have the healthy parent in an unaffected tree, and only have a child. In this case, the IBD contribution from the child is 1/4, and since they’re unaffected and therefore are a counter-filter, they would contribute 1-1/4 = 3/4 to the total relatedness. Either the parent is a non-obligate carrier, or is a non-carrier. The probability of the children depends on which of those is true, so we have the additional set of terms in the theoretical=FALSE logic.
pi-hat value. The proportion of genome shared between individuals.
ibd <- ibd(3, 1, 5, 2, 1, 0.4576484)ibd <- ibd(3, 1, 5, 2, 1, 0.4576484)
Likelihood function for calculation of Pedigree-based autosomal dominant penetrance value. Formula deployed via optimize so as to determine the optimal value.
penetrance(K, a, b, c, d, n)penetrance(K, a, b, c, d, n)
K |
The range of penetrance values to be explored by the optimization function. |
a |
Count of affected individuals |
b |
Count of obligate carriers |
c |
Count of children of either affecteds or carriers, with no children of their own |
d |
Count of trees of unaffected individuals - specifically, two sequential generations (i.e. a parent and their offspring) |
n |
Count of the number of second generation progeny in a given tree. |
K Pedigree-based estimation of autosomal dominant penetrance rate.
K <- optimize(penetrance, c(0,1), 3, 1, 5, 2, 1, maximum=TRUE)$maxK <- optimize(penetrance, c(0,1), 3, 1, 5, 2, 1, maximum=TRUE)$max
Read in variant and status info for individuals.
read.indiv(fname)read.indiv(fname)
fname |
A file name, expected format of contents is: name status variant MS-5678-1001 A 0/1 |
A data frame.
tsv.name1 <-system.file('extdata/1234_ex2.tsv', package = 'KinformR') id.df1 <- read.indiv(tsv.name1)tsv.name1 <-system.file('extdata/1234_ex2.tsv', package = 'KinformR') id.df1 <- read.indiv(tsv.name1)
Read in the encoded pedigree data file.
read.pedigree(filename)read.pedigree(filename)
filename |
name of the file with the data. |
A data frame containing the encoded pedigree information.
example.pedigree.file <- system.file('extdata/example_pedigree_encoding.tsv', package = 'KinformR') example.pedigree.df <- read.pedigree(example.pedigree.file)example.pedigree.file <- system.file('extdata/example_pedigree_encoding.tsv', package = 'KinformR') example.pedigree.df <- read.pedigree(example.pedigree.file)
Row/column intersections give the degree of relationship for the two individuals. 0 = self, -1 = unrelated.
read.relation.mat(fname)read.relation.mat(fname)
fname |
The file with the relationship matrix information. |
A matrix with the relationships and individual ids as rownames and colnames.
mat.name1 <-system.file('extdata/1234_ex2.mat', package = 'KinformR') mat1 <- read.relation.mat(mat.name1)mat.name1 <-system.file('extdata/1234_ex2.mat', package = 'KinformR') mat1 <- read.relation.mat(mat.name1)
Note - ensure the status in the names match your desired encoding! There are individuals with ambiguous statues, that you may require to be encoded in a specific fashion for you current purposes.
read.var.table(fname)read.var.table(fname)
fname |
A file name, expected format of contents is: #CHROM POS REF ALT MS-5678-1001_A MS-5678-1002_U ... chr3 46203838 G A 0/1 0/0 ... |
A dataframe. Data will be worked into a data frame with format. name status variant MS-5678-1001 A 0/1
ex.infile <-system.file('extdata/example_vcf_extract_5678.tsv', package = 'KinformR') read.var.table(ex.infile)ex.infile <-system.file('extdata/example_vcf_extract_5678.tsv', package = 'KinformR') read.var.table(ex.infile)
Score the pedigrees using the pihat values.
score(pihat)score(pihat)
pihat |
Estimated proportion of genome shared between individuals, from function: ibd. |
The score value.
s.val <- score(12.61)s.val <- score(12.61)
By default all individuals are treated as the reference 'proband' and the given variant's score is calculated based on relationships to all other individuals. e.g. for each row in the relationship matrix. calc.rv.score is run, with the row name indicating the reference individual that the calculation is relative to. Note that the relation.mat can include more individuals than are present within the status.df, but the matrix will be subset to include only those individuals that have status information provided.
score.fam( relation.mat, status.df, affected.weight = 1, unaffected.weight = 0.5, return.sums = FALSE, return.means = TRUE, affected.only = TRUE, max.err = 4 )score.fam( relation.mat, status.df, affected.weight = 1, unaffected.weight = 0.5, return.sums = FALSE, return.means = TRUE, affected.only = TRUE, max.err = 4 )
relation.mat |
A relationship matrix for the family. |
status.df |
A dataframe with the encoded variant/disease status of each individual |
affected.weight |
A coefficient to multiply the calculated A.c and A.i relatedness values by. |
unaffected.weight |
A coefficient to multiply the U.c and U.i relatedness values by. |
return.sums |
Boolean indicating if sum of family variant scores should be returned (default = FALSE). |
return.means |
Boolean indicating if mean of all family variant scores should be returned (default = TRUE). |
affected.only |
Boolean indicating if family score should be calculated using only affected individuals (default = TRUE). |
max.err |
A heuristic cap of the number of incorrect assignments allowed when scoring. When the total number of incorrect (sum of affected and unaffected) is exceeded, the variant's score is set to 0, regardless of the number of points for or against. This simplifies scoring and allows for fast filtering of poor quality variants. Default is 4. |
There are several return options possible.
If affected.only is TRUE, the final scores will be reported for only rows where the reference individual is affected (default = True).
If return.means is TRUE, the average scores for the rows will be reported. (default = TRUE)
If return.sums is True, the sum of the scores for all the rows will be reported. (default = False) NOTE: if affected.only = True, the averages and sums are calculated using only the affected reference individuals.
A labelled vector with names: score, score.for, score.against
mat.name1 <- system.file("extdata/1234_ex2.mat", package = "KinformR") tsv.name1 <- system.file("extdata/1234_ex2.tsv", package = "KinformR") mat.df <- read.relation.mat(mat.name1) ind.df <- read.indiv(tsv.name1) ind.df.status <- score.variant.status(ind.df) score.default <- score.fam(mat.df, ind.df.status)mat.name1 <- system.file("extdata/1234_ex2.mat", package = "KinformR") tsv.name1 <- system.file("extdata/1234_ex2.tsv", package = "KinformR") mat.df <- read.relation.mat(mat.name1) ind.df <- read.indiv(tsv.name1) ind.df.status <- score.variant.status(ind.df) score.default <- score.fam(mat.df, ind.df.status)
Determine a value score of families by comparing their relationship structure. More distant relationships between affecteds (e.g. affected cousins) is more valuable that close relationships (e.g. affected siblings) as there is less IBD and therefore a smaller search space.
score.pedigree(h)score.pedigree(h)
h |
A data frame containing the encoded pedigree information |
Simplifying assumptions:
Autosomal dominant
No ambiguous statuses
No more than two sequential generations of unknown carrier status (non-obligate carrier vs. non-carrier). Generalized support of arbitrary tree structures gets a lot more complicated, especially for the likelihood function.
Exclude big giant trees of unaffecteds - related to above. Will slightly bias the result toward higher penetrance.
Exclude subjects younger than age of onset
A data frame containing the theoretical scoring of the power of a family assuming you were able to collect everyone on the simplified pedigree, as well as a current scoreing, examining only those for whom you currently have DNA.
example.pedigree.file <-system.file('extdata/example_pedigree_encoding.tsv', package = 'KinformR') example.pedigree.df <- read.pedigree(example.pedigree.file) penetrance.df <- score.pedigree(example.pedigree.df)example.pedigree.file <-system.file('extdata/example_pedigree_encoding.tsv', package = 'KinformR') example.pedigree.df <- read.pedigree(example.pedigree.file) penetrance.df <- score.pedigree(example.pedigree.df)
Variants can be encoded as binary (0 or 1, genotypes 0/0 or 0/1, or phased genotypes 0|0 0|1). Note the program assumes alt is the disease allele. homozygous alts are allowed.
score.variant.status(indiv.df, theoretical.max = FALSE)score.variant.status(indiv.df, theoretical.max = FALSE)
indiv.df |
A dataframe with the format: name status variant MS-5678-1001 A 0/1 |
theoretical.max |
Should the theoretical maxima be returned instead of the observed values? When true, the scoring assumes correct variant-status pair for each individual. Default is FALSE. |
theoretical.max - bool, default is FALSE when TRUE, function encodes the theoretical max, using a dummy perfect associatng variant generated to see what a family could score. TODO - switch to numbers 1-4 and -1?
Copy of input dataframe, with dataframe with the status categroies added as a new column "statvar.cat"
Take the matrix and subset out only the encoded individuals that are present in the status dataframe.
## S3 method for class 'mat' subset(mat.df, status.df)## S3 method for class 'mat' subset(mat.df, status.df)
mat.df |
The full matrix file to subset |
status.df |
The list of sampled individuals, matrix is subset to only these individuals. |
A subset of the input matrix.