Samples

In order to obtain representative data from across the whole human lifespan, samples were obtained from three sources: (1) Stem Cell Technologies provided frozen mononuclear cells (MNCs) from two cord blood samples that had been collected with informed consent, including for whole-genome sequencing (catalogue no. 70007). (2) Cambridge Blood and Stem Cell Biobank (CBSB) provided fresh peripheral blood samples taken with informed consent from two patients at Addenbrooke’s Hospital (NHS Cambridgeshire 4 Research Ethics Committee reference 07/MRE05/44 for samples collected before November 2019 and Cambridge East Ethics Committee reference 18/EE/0199 for samples collected from November 2019 onwards. (3) Cambridge Biorepository for Translational Medicine (CBTM) provided frozen bone marrow with or without peripheral blood MNCs taken with informed consent from six deceased organ donors. Samples were collected at the time of abdominal organ collection (Cambridgeshire 4 Research Ethics Committee reference 15/EE/0152). All samples were collected under the approved studies quoted, with informed consent to use the materials for the research undertaken here and publish the results. Details of the individuals studied and the samples they provided are illustrated in Fig. 1b, with additional information in Supplementary Table 1.

No statistical methods were used to predetermine sample size. The experiments were not randomized and the investigators were not blinded to allocation during experiments and outcome assessment.

Isolation of MNCs from peripheral blood

MNCs were isolated using Lymphoprep density gradient centrifugation (STEMCELL Technologies), after diluting whole blood 1:1 with PBS. The red blood cell and granulocyte fraction of the blood was then removed. The MNC fraction underwent red cell lysis using 1 incubation at 4 °C for 15 min with RBC lysis buffer (BioLegend). CD34 positive cell selection of peripheral blood and cord blood MNC samples was undertaken using the EasySep human whole blood CD34 positive selection kit (STEMCELL Technologies). The kit was used per the manufacturer’s instructions, but with only a single round of magnetic selection. Bone marrow MNCs did not undergo CD34 positive selection prior to cell sorting.

Fluorescence-activated cell sorting

MNC or CD34 enriched samples were centrifuged and resuspended in PBS/3%FBS containing an antibody panel consisting of: CD3/FITC, CD90/PE, CD49f/PECy5, CD38/PECy7, CD33/APC, CD19/A700, CD34/APCCy7, CD45RA/BV421 and Zombie/Aqua. Cells were stained (30 min at 4 °C) in the dark before washing and resuspension in PBS/3%FBS for cell sorting. For all samples, HSC/MPP pool cells (LinCD34+CD38CD45RA) were sorted using either a BD Aria III or BD Aria Fusion cell sorter (BD Biosciences) at the NIHR Cambridge BRC Cell Phenotyping hub. The gating strategy is illustrated in Extended Data Fig. 1b. The HSC/MPP population was treated as a single entity and not further subclassified in the analysis. The immunophenotypic HSC/MPP population includes both long-term, intermediate-term and short-term HSCs as well as multipotent progenitors, as demonstrated functionally in xenotransplantation assays51,52,53,54.

In a subset of individuals, a small number of HPCs (LinCD34+CD38+) were also sorted. The antibody panel used is shown in Supplementary Table 2. The HPC cells were treated as a single entity in the analysis and not further subclassified. The immunophenotypic HPC compartment includes predominantly myeloid and erythroid progenitors.

Single-cell colony expansion in vitro

Single phenotypic HSC/MPP or HPC cells were index sorted, as above, into Nunc 96 flat-bottomed tissue culture plates (Thermofisher), containing 100 μl supplemented StemPro medium (Stem Cell Technologies) but no mouse cell feeder layer. The following supplements were added to promote proliferation and push differentiation toward granulocyte, monocyte, erythroid and natural killer cell types: StemPro Nutrients (0.035%, Stem Cell Technologies), l-glutamine (1%, ThermoFisher), penicillin-streptomycin (1%, ThermoFisher) and cytokines (SCF, 100 ng ml−1; FLT3, 20 ng ml−1; TPO, 100 ng ml−1; EPO, 3 ng ml−1; IL-6, 50 ng ml−1; IL-3, 10 ng ml−1; IL-11, 50 ng ml−1; GM-CSF, 20 ng ml−1; IL-2, 10 ng ml−1; IL-7, 20 ng ml−1; lipids, 50 ng ml−1). Cells were incubated at 37 °C and the colonies that formed were topped up with 50 μl StemPro medium plus supplements at 14 ± 2 days as necessary. At 21 ± 2 days, a visual size assessment of colonies was undertaken prior to collection of cells for DNA extraction. Larger colonies (≥3,000 cells in size) were transferred to fresh U bottomed 96 well plate (Thermofisher). The U bottomed plates were then centrifuged (500g for 5 min), medium was discarded, and the cells were resuspended in 50μl PBS prior to freezing at −80 °C. Smaller colonies (less than 3,000 but more than 200 cells in size) were collected into 96-well skirted LoBind plates (Eppendorf) and centrifuged (800g for 5 min). Supernatant was removed to 5–10 μl using an aspirator prior to DNA extraction on the fresh pellet. For larger colonies DNA extraction was performed using the DNeasy 96 blood and tissue plate kit (Qiagen). The Arcturus Picopure DNA Extraction kit (ThermoFisher) was used to extract DNA from the smaller colonies. Both kits were used per the manufacturer’s instructions. Overall, 42–89% of the sorted HSC/MPP population in each individual produced a colony. This is much more efficient than previously used methods of colonygrowth in semi-solid media, where it is has been estimated that 10% of single HSC/MPPs will produce a colony. In addition, analysis of cell surface markers showed there was no immunophenotypic difference between sorted HSC/MPPs that formed colonies and were sequenced, compared to those that did not (Extended Data Fig. 1c, d).

Whole-genome sequencing of colonies

A recently developed low-input enzymatic fragmentation-based library preparation method45,55 was used to generate whole-genome sequencing libraries from 1–5 ng extracted DNA from each colony. Whole-genome sequencing was performed at a mean sequencing coverage of 14× (8–35×) on either the Hiseq X or the NovaSeq platforms (Illumina). BWA mem was used to align 150 bp paired end reads generated to the human reference genome (NCBI build 37; GRCh37d5).

Variant calling

CaVEMan (used for calling SNVs) and Pindel (used for calling small indels) were run against an unmatched synthetic normal genome using in-house pipelines45,56,57,58. Outputs were then filtered using several strategies to exclude library, sequencing and mapping artefacts. Structural variants were called using GRIDDS59, with all variants confirmed by visual inspection and by checking if they fit the distribution expected based on the SNV-derived phylogenetic tree. Autosomal CNAs and X chromosome CNAs in females were called using allele-specific copy number analysis of tumours60 (ASCAT), which was run against a single sample selected from each individual. The matched sample was selected to have a coverage > 15×, no loss of Y and to be a singleton in the phylogenetic tree (no coalescences post-birth). The ASCAT output was manually interpreted through visual inspection. ASCAT was unable to accurately call copy number changes on the haploid sex chromosomes in males. Therefore, we also ran the breakpoint analysis algorithm BRASS61 to generate an intermediate file containing information on binned read counts across 500-bp segments of the genome. A comparison of the mean coverage of the X and Y chromosomes was used to call X or Y CNAs in individual samples, which were then validated by visual inspection of read depth.

Full details of the variant calling strategies and filtering approaches used are described in the Supplementary Information.

Filtering at the colony level

As outlined in the main text, we removed some colonies from the dataset due to low coverage (17 samples), being technical duplicates (34 samples) and for showing evidence of non-clonality or contamination (7 samples). We used a peak VAF threshold of < 0.4 (after the removal of in vitro variants) as well as visual inspection of the VAF distribution plots to identify colonies with evidence of non-clonality (Extended Data Fig. 2a–c). Visual inspection was particularly important in the cord blood samples where there was greater variability in the distribution of variant allele frequencies due to the lower mutation burden. In these samples the VAF threshold of < 0.4 was therefore less stringently applied.

Validation of mutation calls

Mutation spectrums were compared between the set of shared mutations (those present in two or more colonies), which are those we have the greatest confidence in, and private mutations (present in only one sample), in which we have lower confidence (Extended Data Fig. 2e). The mutation spectrums are almost identical, providing evidence that the private mutation set does not contain excess artefacts.

Mutation burden analysis

SNV and indel burden analysis was performed by first correcting the mutation and indel burden to a sequencing depth of 30, by fitting an asymptotic regression to the data (function NLSstAsymptotic, R package stats) (Extended Data Fig. 2d). Subsequently, linear mixed effects models were used to test for a linear relationship between age and number of SNVs or number of indels (function lmer, R package lme4). Number of mutations or indels per colony was regressed using log-likelihood maximization and age as a fixed effect, with the interaction between age and donor as a random effect. Progenitor samples were excluded from this analysis.

We also performed linear regression of age and mutation burden as above broken down by age range. Using data from the two cord blood donors and the youngest adult age 29 gives a rate of mutation accumulation of 17.56 per year (95% confidence interval 17.32–17.78). Using data from just the 3 younger donors aged 29, 38, 48 and 63 gives a rate of mutation accumulation of 17.21 per year (95% confidence interval 16.12–18.3). Linear regression of age and mutation burden using the 5 older donors aged 63, 75, 76, 77 and 81 gives a rate of mutation accumulation of 18.84 per (95% confidence interval 16.82–20.86). These results are very consistent over different phases of life.

Telomere analysis

Telomere length for each colony sequenced on the Hiseq platform (corresponding to the telomere length in the founding HSC/MPP or HPC) was estimated from the ratio of telomeric to sub-telomeric reads using the algorithm Telomerecat62. Colonies sequenced on the Novaseq platform could not be used as telomeric reads were removed by a quality control step prior to bam file creation.

Linear mixed effects models were used to test for a linear relationship between age and telomere length across all the adults. Cord blood samples were excluded due to possible non-linearity of the relationship between age and telomere length in very early life63.

Normality testing of the telomere length distributions was performed in R using the Shapiro-Wilk normality test (function shapiro.test) and visualized using Q–Q plots and density plots (functions ggqqplot and ggdensity). The percentage of outlying HSC/MPPs per individual was calculated using the interquartile range criterion (all samples outside the interval [q0.25 − 1.5 × IQR, q0.75 + 1.5 × IQR] are considered as outliers; IQR is the interquartile range; q0.25 and q0.75 are the 25th and 75th centiles respectively) (function boxplot.stats). For all individuals, the only outliers in the data had longer than expected rather than shorter than expected telomeres.

Construction of phylogenetic trees

The key steps to generate the phylogenies shown in Figs. 2, 3 and Extended Data Fig. 5 are as follows:

  1. 1.

    Generate a ‘genotype matrix’ of mutation calls for every colony within a donor. Our protocol, based on whole-genome sequencing of single cell-derived colonies, generates consistent and even coverage across the genome, leading to very few missing values within this matrix (ranging from 0.005 to 0.034 of mutated sites in a given colony across different donors within our cohort). This generates a high degree of accuracy in the constructed trees.

  2. 2.

    Reconstruct phylogenetic trees from the genotype matrix. This is a standard and well-studied problem in phylogenetics. The low fraction of the genome that is mutated in a given colony (<1 per million bases) coupled with the highly complete genotype matrix mean that different phylogenetics methods produce reassuringly concordant trees. We used the MPBoot algorithm64 for the tree reconstruction, as it proved both accurate and computationally efficient for our dataset.

  3. 3.

    Correct terminal branch lengths for sensitivity to detect mutations in each colony. The trees generated in the previous step have branch lengths proportional to the number of mutations assigned to each branch. For the terminal branches, which contain mutations unique to that colony, variable sequencing depth can underestimate the true numbers of unique mutations, so we correct these branch lengths for the estimated sensitivity to detect mutations based on genome coverage.

  4. 4.

    Make phylogenetic trees ultrametric. After step 3, there is little more than Poisson variation in corrected mutation burden among colonies from a given donor. Since these colonies all derived from the same timepoint, we can normalize the branch lengths to have the same overall distance from root to tip (known as an ultrametric tree). We used an ‘iteratively reweighted means’ algorithm for this purpose.

  5. 5.

    Scale trees to chronological age. Since mutation rate is constant across the human lifespan, we can use it as a molecular clock to linearly scale the ultrametric tree to chronological age.

  6. 6.

    Overlay phenotypic and genotypic information on the tree. The tip of each branch in the resulting phylogenetic tree represents a specific colony in the dataset, meaning that we can depict phenotypic information about each colony underneath its terminal branch (the coloured stripes along the bottom of Figs. 2, 3). Furthermore, every mutation in the dataset is confidently assigned to a specific branch in the phylogenetic tree. This means that we can highlight branches on which specific genetic events occurred (such as DNMT3A or other driver mutations).

A complete explanation of the steps undertaken in each of these stages, comparisons of different methods available and notes on the different approaches to validate the phylogenetic trees are available in the Supplementary Information.

Inferring HSC population size trajectory

The R package phylodyn65 provides a well-established approach to inferring historic population size trajectories from the pattern of coalescent events (more specifically the density of these events in historic time blocks) in a phylogenetic tree created from a random sample of individuals in the population. Its use has been pioneered in pathogen epidemiology65 and has also been previously applied to HSPC data from a single individual66. Extended Data Figs. 6b, 7b show how phylodyn can accurately recover simulated population trajectories using sample sizes similar to those we have used and illustrates how the number of coalescent events in a given time window in the tree informs on population size through time (assuming a constant rate of HSC symmetric cell division and a neutrally evolving population). We used phylodyn to infer historic changes in LT-HSC from the ultrametric phylogenies of the four youngest adults in the cohort (Fig. 4a).

Simulations of complete HSC populations from conception to the age of sampling were performed for each individual using the R package rsimpop67 (https://github.com/NickWilliamsSanger/rsimpop). Rsimpop utilizes a birth-to-death model with specified somatic mutation accumulation rate and symmetric cell division rate, to simulate a complete HSC population. Each cell within the population has a rate of symmetric division and a rate of symmetric differentiation (or death). Asymmetric divisions do not impact on the HSC phylogeny and are not accounted for in the model. Full details are available in the Supplementary Information.

Approximate Bayesian computation

We evaluated several ABC models of population dynamics, including both models of selectively neutral genetic drift within the HSC compartment (assessing the effects of changes in population size on phylogenetic trees) and models of positive selection among the HSCs. An additional motivation for performing the Bayesian inferences on neutral models was to enable us to perform posterior predictive checks (PPC), with the aim of deciding whether the observed phylogenies are compatible with neutral models. Note that a separate donor-specific posterior distribution was generated (sampled) for each donor (donor-specific ABC), and a separate donor-specific posterior predictive p-value was computed for each donor (donor-specific PPC). Each donor-specific ABC for the neutral model was performed using the ABC rejection method (R package abc)68,69,70. Further details, together with detailed mathematical exposition, are provided in Supplementary Information.

We used the population trajectory from phylodyn to identify the time period prior to the increase related to a ST-HSC/MPP contribution, and the timing of the midlife and late-life fold-change in (Fig. 4a, Supplementary Fig. 12). We used our data to inform our choices for the time between symmetric cell divisions, which was set at one year (after the initial population growth phase in the first few years of life). We set the rate of mutation accumulation at 15 mutations per year with an additional 1 mutation for every cell division (both of these were drawn from a Poisson distribution centred on the input value).

Analysis of driver variants

Variants identified were annotated with Variation Annotation Generator (VAGrENT) (https://github.com/cancerit/VAGrENT) to identify protein coding mutations and putative driver mutations in each dataset71,72,73,74,75,76,77. Supplementary Table 4 lists the 17 genes we have used as our top clonal haematopoiesis genes (those identified by Fabre et al. as being under positive selection in a targeted sequencing dataset of 385 older individuals78), whose ‘oncogenic’ and ‘possible oncogenic’ mutations (as assessed independently by E.M. and P.J.C.) are shown in Figs. 2, 3 and Extended Data Fig. 5. In the same table, we also list a more extensive list of 92 clonal haematopoiesis genes derived from the literature37, 71, 72 that were interrogated. In order to explore a wider set of cancer gene mutations we used the 723 genes listed in Cosmic’s Cancer Gene Census (https://cancer.sanger.ac.uk/census).

dN/dS analysis

We used the R package dndscv37 (https://github.com/im3sanger/dndscv) to look for evidence of positive selection in our dataset (https://github.com/emily-mitchell/normal_haematopoiesis/5_dNdS/scripts/all_DNDScv_final.Rmd). The dndscv package compares the observed ratio of missense, truncating and nonsense to synonymous mutations with that expected under a neutral model. It incorporates information on the background mutation rate of each gene and uses trinucleotide-context substitution matrices. The approach provides a global estimate of selection in the coding variant dataset (Supplementary Table 7), from which the number of excess protein coding, or driver mutations can be estimated. In addition, it identifies specific genes that are under significant positive selection. Further details are provided in Supplementary Information.

Chromosome Y loss analysis

We observe a series of phylogenetic trees from male individuals in which some clades have lost the Y chromosome. By eye, these clades seem to be larger than clades that have not lost Y. To test this formally, we use a randomization or Monte Carlo test to define the null expected distribution of clade size. For each Monte Carlo iteration, we draw branches of the phylogenetic tree at random—one random branch for each observed instance of Y loss. These branches are sampled (with replacement) from the set of all extant branches at the matched time-point in that individual, and the eventual clade size of that draw measured. For each simulation, the geometric mean (to allow for the log-normality of observed clade sizes) of clade sizes is calculated. We can then compare the distribution of geometric means from the Monte Carlo draws with the observed geometric mean.

Analysis of AML

Mutations in ZNF318 and HIST2H3D were identified in recently described tumour whole-genome sequencing data from 263 patients with AML and myelodysplastic syndromes seen at Washington University School of Medicine in St Louis38. Sequencing, initial processing using the hg38 human reference genome, and full variant calling details for this dataset were described previously38. Briefly, identification of SNVs and indels in ZNF318 and HIST2H3D was performed with Varscan2 run in SNV and indel mode using custom parameters to enhance sensitivity (–min-reads2 = 3, –min-coverage = 6, –min-var-freq = 0.02, and –p-value = 0.01), along with indel callers Pindel and Manta using default parameters. Variant calls identified via these approaches were merged and harmonized using a custom python script and annotated with VEP using Ensembl version 90. Only one potentially pathogenic variant was found in ZNF318. All identified variants are listed in Supplementary Table 8, the majority are likely to be rare germline variants as the sequencing strategy did not incorporate a matched normal sample.

In addition, there were no variants in ZNF318 or HIST2H3D reported in 200 cases of AML in the TCGA dataset79, compared to 51 cases with DNMT3A mutations. Similarly of 71 AML cases in the JAMA study39 23 cases had DNMT3A mutations but none had ZNF318 or HIST2H3D mutations. Both studies used a tumour/normal WGS approach and reported all somatic variants identified as part of their supplementary datasets.

Software

Programs and software used in the data analyses: R: version 3.6.1, BWA-MEM: version 0.7.17 (https://sourceforge.net/projects/bio-bwa/), cgpCaVEMan: version 1.11.2/1.13.14/1.14.1 (https://github.com/cancerit/CaVEMan), cgpPindel: version 2.2.5/3.2.0/3.3.0 (https://github.com/cancerit/cgpPindel), Brass: version 6.1.2/6.2.0/6.3.0/6.3.4 (https://github.com/cancerit/BRASS), ASCAT NGS: version 4.2.1/4.3.3 (https://github.com/cancerit/ascatNgs), VAGrENT: version 3.5.2/3.6.0/3.6.1 (https://github.com/cancerit/VAGrENT), GRIDSS: version 2.9.4 (https://github.com/PapenfussLab/gridss), MPBoot: version 1.1.0 (https://github.com/diepthihoang/mpboot), cgpVAF: version 2.4.0 (https://github.com/cancerit/vafCorrect), dNdScv: version 0.0.1 (https://github.com/im3sanger/dndscv), Telomerecat: version 3.4.0 (https://github.com/jhrf/telomerecat), Rsimpop: version 2.0.4 (https://github.com/NickWilliamsSanger/rsimpop), Phylodyn: version 0.9.02 (https://github.com/mdkarcher/phylodyn) and FlowJo: version 10.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this paper.



Source link