Project Euler 352
Project Euler 352
题目
Blood tests
Each one of the \(25\) sheep in a flock must be tested for a rare virus, known to affect \(2\%\) of the sheep population. An accurate and extremely sensitive PCR test exists for blood samples, producing a clear positive / negative result, but it is very time-consuming and expensive.
Because of the high cost, the vet-in-charge suggests that instead of performing \(25\) separate tests, the following procedure can be used instead:
The sheep are split into \(5\) groups of \(5\) sheep in each group. For each group, the \(5\) samples are mixed together and a single test is performed. Then,
- If the result is negative, all the sheep in that group are deemed to be virus-free.
- If the result is positive, \(5\) additional tests will be performed (a separate test for each animal) to determine the affected individual(s).
Since the probability of infection for any specific animal is only 0.02, the first test (on the pooled samples) for each group will be:
- Negative (and no more tests needed) with probability \(0.98^5 = 0.9039207968\).
- Positive (\(5\) additional tests needed) with probability \(1 - 0.9039207968 = 0.0960792032\).
Thus, the expected number of tests for each group is \(1 + 0.0960792032 \times 5 = 1.480396016\).
Consequently, all \(5\) groups can be screened using an average of only \(1.480396016 \times 5 = \mathbf{7.40198008}\) tests, which represents a huge saving of more than \(70\%\) !
Although the scheme we have just described seems to be very efficient, it can still be improved considerably (always assuming that the test is sufficiently sensitive and that there are no adverse effects caused by mixing different samples). E.g.:
- We may start by running a test on a mixture of all the \(25\) samples. It can be verified that in about \(60.35\%\) of the cases this test will be negative, thus no more tests will be needed. Further testing will only be required for the remaining \(39.65\%\) of the cases.
- If we know that at least one animal in a group of \(5\) is infected and the first \(4\) individual tests come out negative, there is no need to run a test on the fifth animal (we know that it must be infected).
- We can try a different number of groups / different number of animals in each group, adjusting those numbers at each level so that the total expected number of tests will be minimised.
To simplify the very wide range of possibilities, there is one restriction we place when devising the most cost-efficient testing scheme: whenever we start with a mixed sample, all the sheep contributing to that sample must be fully screened (i.e. a verdict of infected / virus-free must be reached for all of them) before we start examining any other animals. For the current example, it turns out that the most cost-efficient testing scheme (we’ll call it the optimal strategy) requires an average of just \(\mathbf{4.155452}\) tests!
Using the optimal strategy, let \(T(s,p)\) represent the average number of tests needed to screen a flock of s sheep for a virus having probability \(p\) to be present in any individual.
Thus, rounded to six decimal places, \(T(25, 0.02) = 4.155452\) and \(T(25, 0.10) = 12.702124\).
Find \(\sum T(10000, p)\) for \(p=0.01, 0.02, 0.03, \dots 0.50\).
Give your answer rounded to six decimal places.
解决方案
每次做一次 PCR 检测,结果只有 阴性/阳性 两种。题目额外限制是:一旦你对某个混样做了检测,那么参与该混样的所有羊必须先被完全判定(感染/未感染),之后才能开始检查其它羊。这条限制非常关键:它强迫策略呈现“深度优先的递归结构”:
- 你先从未确定的羊里选出一组(大小为 \(k\))做混检;
- 若阴性,则这 \(k\) 只羊全部判为健康,接着处理剩下的羊;
- 若阳性,则你必须把这 \(k\) 只羊完全查清(可以继续对子集混检、或单检),查清后才允许回到剩余羊群继续。
因为每只羊独立同分布、且没有个体差异,所以一个子问题只需要由两件事决定:
- 羊的数量 \(n\);
- 我们是否“已知至少有一只感染”。
于是自然得到两个 DP 状态。设单只羊感染概率为 \(p\),令 \(q = 1-p\)。定义混检阳性的概率(至少一只感染)为\(P(n)=1-q^n=1-(1-p)^n.\)定义:
- \(F(n)\):对 \(n\) 只羊进行完全筛查的最小期望检测次数,在开始时并不知道这 \(n\) 只里是否有人感染。
- \(G(n)\):对 \(n\) 只羊进行完全筛查的最小期望检测次数,但已知这 \(n\) 只里至少一只感染。
首先推导\(F(n)\)的转移。对一组“冷”的 \(n\) 只羊,第一步选 \(k\) 只做混检(\(1\le k\le n\)),先消耗 \(1\) 次检测。
- 阴性概率是 \(q^k\),则这 \(k\) 只全部判健康,剩下 \(n-k\) 只仍是冷问题,代价 \(F(n-k)\)。
- 阳性概率是 \(P(k)=1-q^k\),则这 \(k\) 只变成“已知至少一只感染”的 warm 问题,代价 \(G(k)\);之后再做剩余 \(n-k\) 只的冷问题,代价 \(F(n-k)\)。
所以该选择的期望代价为\(1 + q^k\cdot F(n-k) + P(k)\cdot (G(k)+F(n-k))= 1 + F(n-k) + P(k)G(k).\)
因此,我们可以写出\(F(n)\)的状态转移方程为: \[ F(n)= \left\{\begin{aligned} &0 && \text{if}\quad n=0\\ &1 && \text{if}\quad n=1\\ &\min_{1\le k\le n}\{1+F(n-k)+P(k)G(k)\} && \text{else} \end{aligned}\right. \]
其中,第二行\(F(n)=1\)表示单检一次;对于第三行的\(k=n\) 对应“先把全部 \(n\) 只混在一起测一次”的策略,其期望为\(1 + P(n)G(n).\)
现在考虑 warm 状态:已知 \(n\) 只中至少一只感染(事件记为 \(E\))。
仍然先选 \(k\) 只做混检(\(1\le k\le n-1\)),记事件 \(A\) 为“这 \(k\) 只里至少一只感染”。
注意到 \(A\Rightarrow E\),因此
\[ \Pr(A\mid E)=\frac{\Pr(A)}{\Pr(E)} =\frac{P(k)}{P(n)} =\frac{1-q^k}{1-q^n}. \]
做完这一测(消耗 \(1\) 次检测)后:
- 若 \(A\) 发生(子组阳性),则子组是 warm:代价 \(G(k)\);剩下 \(n-k\) 只并没有任何“必有感染”的信息(感染可能都在子组里),因此剩余变成 cold:代价 \(F(n-k)\)。总计 \(G(k)+F(n-k)\)。
- 若 \(A\) 不发生(子组阴性),由于已知整体至少一只感染,感染只能在剩余 \(n-k\) 只里,因此剩余仍是 warm:代价 \(G(n-k)\)。
于是我们可以写出\(G(n)\)的状态转移方程:
\[ G(n)= \left\{\begin{aligned} &0 && \text{if}\quad n\le 1\\ &\min_{1\le k\le n-1}\left\{1+\frac{P(k)}{P(n)}\bigl(G(k)+F(n-k)\bigr)+\left(1-\frac{P(k)}{P(n)}\right)G(n-k)\right\} && \text{else} \end{aligned}\right. \]
至于,\(G(1)=0\),是因为已知这一只必感染,不用再测。
若朴素枚举所有 \(k\),计算到 \(n=10000\) 需要 \(O(n^2)\),对每个 \(p\) 都太慢。
关键观察:当 \(k\) 很大时,\(q^k\) 极小,混检几乎必阳性,那么“先混检一次”带来的主要收益(直接判掉一大组全阴性)几乎不可能发生;而一旦阳性,还得进入 \(G(k)\) 的递归判定,整体通常不划算。
当 \(k\) 不大时用指数近似:\(q^k=(1-p)^k \approx e^{-k(-\ln(1-p))}.\)把阴性概率设成某个常数 \(c\)(例如 \(q^k=c=e^{-2}\),那么解得\(k \approx \dfrac{2}{-\ln(1-p)}\)
这一量级附近。实现上可以取一个上界\(K=\lceil k\rceil +5\)
并且:
- 在 \(F(n)\) 中 始终额外把 \(k=n\) 作为候选,因为全体先混一次是唯一可能显著大于 \(K\) 仍有价值的特殊情况);
- 在 \(G(n)\) 中只需枚举 \(k\le \min(n-1,K)\)。
代码
1 |
|