Who infects whom?—Reconstructing infection chains of Mycobacterium avium ssp. paratuberculosis in an endemically infected dairy herd by use of genomic data

Recent evidence of circulation of multiple strains within herds and mixed infections of cows marks the beginning of a rethink of our knowledge on Mycobacterium avium ssp. paratuberculosis (MAP) epidemiology. Strain typing opens new ways to investigate MAP transmission. This work presents a method for reconstructing infection chains in a setting of endemic Johne’s disease on a well-managed dairy farm. By linking genomic data with demographic field data, strain-specific differences in spreading patterns could be quantified for a densely sampled dairy herd. Mixed infections of dairy cows with MAP are common, and some strains spread more successfully. Infected cows remain susceptible for co-infections with other MAP genotypes. The model suggested that cows acquired infection from 1–4 other cows and spread infection to 0–17 individuals. Reconstructed infection chains supported the hypothesis that high shedding animals that started to shed at an early age and showed a progressive infection pattern represented a greater risk for spreading MAP. Transmission of more than one genotype between animals was recorded. In this farm with a good MAP control management program, adult-to-adult contact was proposed as the most important transmission route to explain the reconstructed networks. For each isolate, at least one more likely ancestor could be inferred. Our study results help to capture underlying transmission processes and to understand the challenges of tracing MAP spread within a herd. Only the combination of precise longitudinal field data and bacterial strain type information made it possible to trace infection in such detail.


Introduction 36
Mycobacterium avium ssp. paratuberculosis (MAP) is the causative agent of Johne's disease, 37 or paratuberculosis, a chronic, slowly progressing disease of ruminants associated with high 38 economic losses, especially in dairy herds. Challenges in the surveillance and control of MAP are a 39 long incubation period of 1-15 years (1), and inefficient diagnostic tests, which lead to limited success 40 of control programmes. The role of MAP in the pathogenesis of Crohn's disease in humans is still 41 controversial (2). 42 The primary route of MAP infection is faecal-oral by direct or indirect contact with the 43 pathogen. Calves are highly susceptible during the first weeks after birth, and resistance to infection 44 increases until one year of age (3). Calves become infected either horizontally or vertically (in utero). 45 Transmission from dams to calves at an early age is currently regarded as the most important route of 46 infection and is therefore the focus of many control programmes. In adults, ingestion of MAP does not 47 necessarily lead to infection, but repeated uptake of high doses of bacilli may result in adult infection 48 (4,5). Adult-to-adult, calf-to-calf and heifer-to-heifer infections have been shown to exist (4,6-9). 49 These routes typically receive little attention in MAP control programmes. 50 Large differences in MAP shedding patterns can be observed. Intermittent shedders, low 51 shedders (≤50 colony-forming units per gram (cfu/g) faecal matter), high shedders (>50-10^4 cfu/g 52 faecal matter), and super-shedders (>10^4 cfu/g of faecal matter) are known shedding categories for 53 individual animals. The majority of cows will never develop high shedding levels, since many cows 54 never reach advanced enough age (10). Schukken  suggesting that they have the infection process under control. Building on these findings, Mitchell et 59 al. distinguished between two categories of progressors, linked to immune control and the age at 60 onset of shedding: cows that start shedding at a younger age partially control the infection, but 61 eventually become high shedders (slow progressive infection), while cows that start shedding 62 persistently at an older age progress rapidly with shedding and lack effective control of infection (10). 63 Obviously, super-shedders represent the greatest risk for spreading MAP among herd mates (11). 64 However, removing high-shedding animals (which are easily detected) has shown to be insufficient to 65 address long-term persistence of MAP (12,13). Simulation models have given further support to the 66 hypothesis that intermittent, low and transiently shedding animals play an important role in 67 maintaining low prevalent infections in dairy herds (14). Quantitative estimates of the importance of 68 transmission routes at all ages of the host and of the role of animals presenting these different 69 shedding patterns are essential to decide on relevant control procedures. 70 The MAP genome is extremely stable with an estimated mutation rate µ of the core genome 71 of one mutation per 2-7 years (15). Earlier literature assumed clonal infections of herds with a single 72 strain of MAP bacteria (16,17). However, several studies now have shown that multiple strains of 73 MAP may be simultaneously present in a herd (7,18), suggesting that several concurrent infection 74 cycles within a single population are possible. A more recent study even demonstrated the incidence 75 of a mixed, simultaneous infection by three genotypically diverse MAP isolates in a single dairy cow 76 (19). Within the individual host, the MAP population is initially thought to be genomically 77 homogeneous, but will diversify over time due to mutations. These processes of within-host evolution 78 of MAP and mixed genotype infection of hosts with multiple MAP strains need to be considered in 79 further studies to draw valid conclusions about the complexities of MAP transmission (20,21). For 80 such studies sequencing a single isolate from each case was suggested to be inadequate in the 81 presence of within-host diversity, but frequent sampling will improve accuracy (22). For MAP it is 82 currently not known whether the low mutation rate will allow detailed analyses of infection chains. 83 With whole genome sequencing (WGS) data the highest possible degree of discrimination between 84 pairs of isolates can be achieved. Nevertheless, integration of non-WGS data into analysis of 85 transmission pathways is suggested to lead to considerable refinement in our understanding of the 86 epidemiology of mycobacterial disease (20). 87 With falling costs of large-scale genome sequencing and advances of biostatistical tools, 88 population genomic studies are increasingly used to study pathogen spread within populations. 89 Traditionally, network inference models were used to identify transmission chains in early stages of 90 disease outbreaks. In endemic settings, network inference faces multiple challenges, such as: (a) 91 non-sampled early generations of cases and thus uncertainty about which of the sampled strains is 92 genomically closest to the originally introduced strain and can thus be considered as the most recent 93 common ancestor; (b) multiple introductions of genomically diverse strains over time, resulting in a 94 polyphyletic sample; and (c) as a consequence of (b), exposure of hosts to multiple strains which may 95 lead to mixed genotype infections. 96 This study aims to identify individual animal-to-animal infection chains ("who infects whom"), 97 in order to better understand the infection dynamics of MAP in endemically infected dairy herds. 98 Transmission trees will be constructed by using WGS data in combination with detailed longitudinal 99 epidemiological data. Support of the reconstructed infection chains for the current prevailing 100 hypotheses on transmission routes will be evaluated, and the role of individual animals in infection 101 spread will be investigated. In addition, within-host and within-herd diversity of MAP will be 102 characterised to provide fundamental input to all advanced analyses. To conclude, it shall be 103 discussed whether observational field data are precise enough to perform relevant analyses to inform 104 future research. 105

Analysis of strain diversity
170 Strain diversity was estimated at cow-level (within-host strain diversity) and at herd-level 171 (within-herd strain diversity). Isolates sampled from the same cow were compared to isolates shed by 172 other cows. As a measure of genomic diversity, pairwise distance was calculated based on the 173 number of SNPs between each pair of isolates. Isolates with zero SNP differences are referred to as 174 isolates with identical genotype; isolates that differ by at least 1 SNP are referred to as different 175 genotypes. MEGA7 was used to estimate the maximum likelihood phylogeny (30). 176 Inference of ancestry follows a strict hierarchy: 1) temporal order: isolates with the earliest 194 dates are at the root, and those with the latest dates are at the tips of the reconstructed tree: we used two different dates: inferred start of shedding and birth date of cow; see "Scenarios", 2) genomic 196 distance: based on the number of SNP differences between pairs of isolates (entered as distance 197 matrix), 3) epidemiological weight: see "scenarios", and 4) probability p of observing a given number 198 of mutations between an isolate and its ancestor: p was computed based on the mutation rate µ of the 199 pathogen (0.25 substitutions/core genome/year), time interval between each pair of isolates, and 200 length of partial nucleotide sequences (1,472 SNPs), using maximum likelihood. As the genomic 201 distance between two isolates a and b increases, p decreases that a is the direct ancestor of b. 202

Reconstruction of transmission trees
SeqTrack only relies on epidemiological weights and p to resolve ties in the choice of ancestry: if an 203 isolate has more than one potential ancestor in identical genomic distance, the ancestor with the 204 higher weight is assigned. If two potential ancestors have the same weight, the ancestor with the 205 higher p value is assigned. The analysis is thus largely insensitive to µ. 206 However, isolate specific parameters were sampling date, pen contacts of the cow at sampling day, 215 and duration of exposure to other MAP-shedding cows. 216

Extension of SeqTrack to endemic infection
Transmission events at isolate level are referred to as ancestries between an ancestor and its 217 descendant; at cow level, the terms source of infection and recipient are used. A source or recipient 218 could be either another cow or an environmental sample. 219

220
If the identical genotype could be isolated multiple times from the same cow, duplicate 221 sequences (n = 22) were discarded. Continuous shedding from the earliest to the last sampled isolate 222 of this genotype was assumed and the infectious period was set as described in the next section.   SeqTrack will define the best fitting transmission tree based on maximum parsimony. The 288 basic model without weights as described in these scenarios will logically provide the maximum 289 parsimony model. The basic model is only based on genomic distances and does not take 290 epidemiological limitations into account. A genomic connection between two isolates from two cows 291 that in real life were never on the farm at the same time is acceptable in the basic model, but will 292 receive a very low weight in the model expanded with epidemiological information. Therefore, 293 maximum parsimony should only be compared within the same scenario or between scenarios when 294 no conflicting epidemiological information is present. 295 Within-herd circulation of genomically diverse strains 296 SeqTrack assumes monophyletic genomic data (infection caused by a single external source) 297 and will add all isolates into one single transmission tree, independent of how distant (and thus less 298 likely) reconstructed ancestries may be. With the low mutation rate of MAP, strain diversification within 299 the study period through evolution was expected to be limited. A genomic distance threshold of 6 SNPs was defined based on the overall herd-level genomic diversity (Fig 2). Pairs of isolates 301 exceeding this threshold were considered not to have arisen from directly linked cases. If no ancestor 302 within this threshold could be found in the sample for a particular isolate, it was set as the root of a 303 separate transmission tree, indicating that the true ancestor had not been sampled. Generally, the 304 lower a threshold of number of SNPs is chosen, the more transmission trees with multiple generations 305 will be broken up into smaller individual trees or unconnected singleton isolates (resulting trees of a 306 sensitivity analysis with different threshold values are not shown). 307 true ancestor in the sample, SeqTrack assigns more descendants to the earliest sampled isolates. In 313 addition, data were right censored, and the number of descendants were underestimated, particularly 314 for isolates sampled in the late phase of the study. Whereas the transmission trees were 315 reconstructed with all 128 isolates, the following conservative assumptions were made for the 316 estimation of genotype-specific and cow-specific reproduction ratios to account for temporality in the 317 data structure: For the earliest 10% of isolates only one third of the assigned descendants were 318 assumed to be their true descendants. For the latest 10% of isolates the number of descendants per 319 isolate were not calculated as these descendants were not yet fully sampled within the study period. 320 Consequently, the role of individual cows in infection spread was analysed for the remaining 84 321 isolates, sampled from 57 cows. 322

323
If isolates with identical genotype can be sampled over years from generations of cows, a 324 range of alternative infection chains may exist (Fig 3). In addition to the number of recipients 325 each cow, assuming that all isolates with identical genotype that fulfilled certain criteria could be 327 descendants of the same cow. The criteria were those for [S] weights 4-6, as described under 328 "Scenarios". 329 Strain diversity within-herd (at herd-level) 374 A total of 94 different genotypes were recovered from all 150 sequenced isolates (Fig 4); of 375 these 84 (89%) were only detected once. The most prevalent genotype was isolated 32 times from 376 twelve cows and four environmental samples. Several MAP strains with genomic distances of more 377 than 100 and 200 SNPs between strains were recorded (Fig 2).   Thirty-eight (30%) isolates were assigned the identical ancestor in all six scenarios, and for 6 415 (5%), 18 (14%) and 49 (38%) isolates the algorithm returned five, four or three times with an identical 416 ancestor, respectively. No isolate was assigned a different ancestor in every scenario, and for each 417 isolate, at least one ancestral isolate could be identified with more statistical and/or epidemiological 418 support.  Table). 431 Adult-to-adult contact during the infectious period was in all scenarios except [birth_Basic] the 432 most frequent social network pattern leading to infection, followed by direct contact during adulthood 433 outside the infectious period and indirect contact (Table 1)     where weighting preference was given to increased susceptibility in young animals. 522

Role of individuals in MAP spread
Concurrent circulation of dominant strains could be recorded over several years, indicating 523 that some strains were more successful in terms of transmission and infection progression (7,8). 524 Other important features of transmission trees were some minor strains that could only be recorded 525 over 2-4 generations of transmissions, and singleton isolates not related to any other isolate. 526 Transmission studies assuming spread of a monophyletic strain will certainly underestimate the 527 complexity of multiple infection chains occurring in parallel. 528 A particular challenge was inconclusive ancestries for clusters of isolates with identical 529 genotype within the dominant strains. In situations with simultaneous exposure of a susceptible 530 individual to multiple shedders of the same genotype, neither genomic distance nor contact data can 531 be used to resolve ties in ancestries. However, the critical question is: is it relevant to know whether 532 cow X or cow Y infected cow Z, given that large parts of the operation were perpetually contaminated 533 with one strain? Management-wise, a holistic control strategy would be required, as removing single 534 known shedders could result in limited success in interrupting the infection cycle. 535

536
Mixed infections are common. This finding adds complexity to the estimation of standard 537 measures in epidemiology, such as the effective reproduction ratio R. Despite the decreasing 538 prevalence during the study period, several scenarios resulted in an animal-specific R A >1. An 539 explanation for this contradiction is that R A does not consider that some cows acquire infection more 540 than once whereas the proportion of cows that apparently remained MAP negative gradually 541 increased over time. Dependent on this clustering of co-infections with different MAP genotypes, R A 542 will be considerably larger than the mean number of newly infected recipients generated in the herd 543 (R H ) from one generation to the next in the infection chain. This leads to two conclusions: First, 544 infected cows remain susceptible to co-infections with other genotypes. This finding will be relevant 545 for mathematical models: transition between infection states might not be as clear-cut as often 546 assumed. Second, R A >R H indicates that some cows have a lower risk of becoming infected than 547 others. How much of these differences can be explained by varying exposure intensity, susceptibility 548 or an alternative explanation was beyond the scope of this study and remains to be investigated. 549  8), who showed that cows that were infected with a particular MAP strain were significantly more exposed as adults to cows shedding the same strain compared to cows that were culture negative for 576 MAP at slaughter. Cow-to-calf and calf-to-calf contacts during early life accounted for less than 1 in 12 577 transmissions. These analyses thus did not support the hypothesis that dam-daughter infections were 578 the principal transmission route on the farm under investigation. This farm had implemented rigid 579 interventions for MAP control, daughters were separated from dams as quickly as possible and 580 contact between age groups was limited. This could explain the minor role of dam-newborn calf and 581 calf-to-calf transmissions and indicate success of the implemented interventions. However, MAP 582 exposure during adulthood appeared to be sufficient to maintain infection over years. To confirm this 583 hypothesis, the analysis needs to be repeated for farms with different calf management protocols to 584 investigate correlation between interventions and contribution of transmission routes for long-term 585 persistence of MAP. 586

587
In the presence of within-host diversity, analysis of a single isolate will miss clonal, closely 588 Only recently, have methods been published which allow the integration of additional 608 epidemiological information (47-49). These authors highlight the need to expand outbreak 609 reconstruction tools to utilize other types of epidemiological data due to limited information value of 610 sequence data (48), or even conclude for certain outbreaks that contact data is equally or more 611 informative than sequencing data (47). 612 Our study objectives required an algorithm that supported incorporation of a much broader 613 collection of detailed epidemiological field data. SeqTrack proved to be a suitable backbone, versatile 614 enough to be expanded for investigations of endemic disease. The algorithm has limitations regarding 615 inference of the underlying transmission process and timing of exposure. However, SeqTrack 616 supports incorporation of all kinds of data to describe proximity in the form of weighting matrices, a 617 criterium we prioritized in the assessment of available methods to apply in our study as strain 618 diversification was expected to be limited. Weighting by exposure time [E] was based solely on 619 explicit data and weighing by susceptibility [S] was based on accepted knowledge of MAP 620 epidemiology. Of note, scenarios informed by data and knowledge resulted in more consistent 621 ancestries compared to scenarios based solely on genomic data. On purpose, no vague assumptions 622 about transmission routes were made, to enable independent evaluation of reconstructed networks 623 for transmission routes. To allow for ancestries with short time-intervals between two transmission 624 events, no minimum duration of pre-infectious period was entered. Hence, transmission routes for 625 scenarios in which time between MAP uptake and shedding is assumed to be short, such as calf-to-626 calf transmission or pass-through shedding of adults, could be investigated. Testing for correlation 627 between phenotypes and number of recipients in reconstructed networks demonstrated that the 628 results produced with this method are consistent with the literature. Data on a disease phenotype 629 were thus utilisable for validation and gave further support to the accuracy of the inferred ancestries.
The authors are confident that this method will also be valuable for estimation of strain-specific 631 transmission parameters. 632 An apparent limitation of SeqTrack is its strong dependence on temporal ordering of isolates. 633 The algorithm does not clearly account for the potentially long delay between unobserved exposure 634 and observed sampling events. Due to the long incubation period of MAP, the order of ancestry 635 cannot be inferred with certainty. Other approaches such as Bayesian algorithms will be superior in In this field, MAP poses great challenges as its detection and cultivation are difficult and time-647 consuming. For this study, although one of the most detailed MAP cohorts published to date, ideally 648 (at least) one more generation of cows should have been sampled to reduce the impact of both right-649 and left-censored data on the results. As MAP prevalence decreases, sampling effort increases even 650 more compared to each WGS sequence gained in return. In this study, on average 33 faecal, tissue 651 or environmental samples were cultured per sequenced isolate; towards the end of the study this 652 number increased towards 100 samples for one recovered sequence. Particularly for studying low 653 prevalent MAP, field studies will quickly reach their resource limit (in the absence of diagnostic 654 methods to identify MAP infected individuals more efficiently). However, only field data can show the 655 true variation in incidence, change of prevalence and effects of control interventions on transmission 656 in a non-controlled environment. This information underlies the basis for forming hypotheses to be followed up with experimental studies in animal models under controlled conditions and for 658 parameterization of modeling exercises to complement field studies. 659

660
Although all cows were sampled periodically, the within-herd MAP population was only 661 incompletely sampled. Low test sensitivity will have led to unobserved infected cows or additional 662 MAP genotypes shed by known MAP-shedders. Since infection source is only attributed to sampled 663 isolates, adding such non-detected isolates to the analysis could either result in new common sources 664 of infection for separate transmission trees or cause additional generations between (wrongly) 665 inferred direct ancestries within a transmission tree. Nevertheless, the original isolates would remain 666 in the same transmission tree. The reproductive ratios at genotype level R GT would have been higher; 667 the above presented values thus represent estimates at the low end. With more cows potentially 668 contributing to MAP spread, the R A would have varied for individual cows. Assuming that all 669 genotypes were affected to a similar extent by the non-perfect test characteristics, the authors believe 670 that the main conclusions of this work remain valid. 671 Genotypes isolated from the environment could be simultaneously detected in individual cow 672 samples in only 9% of cases, and for a small proportion of ancestries no epidemiological link was 673 recorded at all, indicating unobserved cases. The observed strain diversity in tissues sampled in 674 parallel, highlights the value of sequencing more than one individual colony from a sample to 675 understand within-host diversity and to infer ancestries with more certainty. An optimized combination 676 of sampling and diagnostics is needed to capture genomic diversity with minimized sampling effort. 677 As youngstock were not sampled, no genomic data to directly confirm calf-to-calf or adult-to-678 calf transmission were available. From the data it could not be determined whether genotypes that led 679 to the initial infection were among the available collection of isolates from a cow. In addition, temporal 680 order of isolation was not necessarily in the order of genomic evolution: mutated genotypes might 681 have been sampled before their ancestor. The sensitivity of SeqTrack to time was taken into account 682 by calculating scenarios with two different time indicators: it could be shown that the order of isolates 683 benefit from strain-specific transmission parameterisation. To be able to observe the full range of 710 diversity in samples with heterogeneous MAP populations, methods for pathogen isolation are 711 needed which support detection and quantification of multiple genotypes. This work presents a 712 method for reconstructing "who infects whom" based on genomic data with greater epidemiological 713 and statistical support. Reconstructed infection chains confirmed high shedding and exposure to 714 shedders over longer time periods as risk factors for MAP transmission on the investigated dairy farm. 715 We believe that the method will be useful for further studies on the relevance of transmission routes 716 and role of individuals expressing distinct disease phenotypes in infection dynamics of endemic 717 disease. 718 WGS is invaluable in studying pathogen transmission, both with outbreaks and in endemic 719 settings. However, WGS is not a solution for low test sensitivity which leads to non-observed isolates. 720 In addition, especially for MAP-like pathogens the question remains, when does an animal become 721 infected and infectious? The method presented in this work is able to indicate where infection cycles 722 went undetected. This information can be used to adapt sampling to better capture underlying 723 transmission processes. Knowledge of pathogen biology and availability of precise longitudinal data 724 are crucial to maximise benefits of WGS and validly reconstruct infection chains. 725