前一陣我們翻譯Rob Knight的綜述，1.8萬字，讓你熟讀2遍輕鬆握掌微生物組領域分析框架、把握未來分析趨勢。目前在宏基因組平台累計1.9萬人次，熱心腸平台首發閱讀8500+，科學網加精置頂閱讀8700+，CSDN閱讀1200+，四大平台閱讀人數近4萬次。

還沒有仔細學習的，趕快讀兩遍吧！

Nature綜述：大佬手把手教你分析菌群數據——1.8萬字

其中提到了一款追蹤微生物來源的軟體SourceTracker，被多位朋友留言，今天為大家分享一下軟體的結果解讀。同時詳細說明輸入文件的準備和代碼的注釋，方便大家上手應用於自己的數據。

SourceTracker有什麼用？

用途是可以識別相關各組間來源的分析，如具體的問題：

嬰兒的腸道菌群有哪些繼承了母親的腸道菌群、哪些來自陰道菌群、哪些來自皮膚
法醫學的應用，屍體中的菌群與來源土壤的鑑定、腐敗菌來自本身，還是周圍環境
河流污染物的來源分析、周圍工廠、農田、養殖廠對河流污染的貢獻和來源追溯。
分析植物菌組形成過程：植物根際菌在土壤中來源和種子來源；葉際菌群的土壤來源比例等。

軟體簡介

Bayesian community-wide culture-independent microbial source tracking 於2011年發表於Nature Methods

由聖地亞哥大學的Scott教授及Rob Knight團隊合作完成。第一作者Dan Knights。

Google統計目前引用299次。

該軟體中目標樣本為Sink，微生物污染源或來源的樣品為Source；基於貝葉斯算法，探究目標樣本（Sink）中微生物污染源或來源（Source）的分析。根據Source樣本和Sink樣本的群落結構分布，來預測Sink樣本中來源於各Source樣本的組成比例。

我們之前解讀過Rob Knight的一篇Sciences文章中圖2A就使用此軟體分析確定屍體腐敗過程中主要菌來自於土壤的結果。

16S+功能預測發Sciences

軟體結果解讀

SourceTracker分析圖a，預測樣本來源比例柱狀圖。一幅圖代表一個預測樣本，用不同顏色的柱子表示該樣本中各來源的比例，Unknow代表未知來源分類，誤差線代表100次Gibbs採樣的標準差。

SourceTracker分析圖b，預測樣本來源比例面積圖。一幅圖代表一個預測樣本，不同顏色代表不同來源的比例，每一列代表一次Gibbs採樣結果，100次Gibbs採樣結果按照相近的排列順序進行展示。

SourceTracker分析圖c，預測樣本來源比例餅圖。一個餅圖代表一個預測樣本，不同顏色扇形的比例代表該預測樣本中各來源的比例。

原文解讀

DOI: 10.1038/nmeth.1650

圖1 SourceTracker和其他模型的比較。所示模型估計模擬樣本中兩個源環境的比例，Jensen-Shannon差異表示環境之間的重疊程度從0（完全相同，因此不可能消除歧義）到1（完全不重疊，因此容易區分）。繪製了估計比例的確定係數（r2）。每個點代表100個樣本的三次試驗的平均R2；誤差條顯示均值的標準誤s.e.m.（n=3）。

Figure 1 | Comparison of SourceTracker and other models. Indicated models estimate the proportions of two source environments in simulated samples, as the degree of overlap between the environments was varied from a Jensen-Shannon divergence of 0 (completely identical and thus impossible to disambiguate) to 1 (completely non-overlapping and thus trivial to disambiguate). The coefficients of determination (R2) of the estimated proportions are plotted. Each point represents the mean R2 for three trials of 100 samples each; error bars show s.e.m. (n = 3).

Nature Method [1]原文：每個圖形代表一個樣本Sink，分別是Lab1的PCR水、NICU桌子、辦公室電話；不同顏色表示不同樣本來源Source，所占面積為在Sink樣本中各來源的比例。

圖2. SourceTracker對洗碗池樣本子集的比例估計。（a–c）使用SourceTracker估計的三個洗碗池樣本的源環境比例，每個源環境中包括45個訓練樣本：吉布斯採樣中100次取樣的平均比例（a），相同樣本的數據，包括比例估計的標準變異S.D.（b）。100次吉布斯繪製的可視化；每個列顯示一次採樣的結果，列按保持相似的混合物進行排列在一起，使圖形看起來看美觀、更容易觀察和比較規律（c）。

Figure 2 | SourceTracker proportion estimates for a subset of sink samples. (a–c) Source environment proportions for three sink samples estimated using SourceTracker and 45 training samples from each source environment: mean proportions for 100 draws from Gibbs sampling (a), data for the same samples, including s.d. of the proportion estimates (b), and visualization of the 100 Gibbs draws; each column shows the mixture from one draw, with columns ordered to keep similar mixtures together (c).

圖3 常見污染操作分類單元（OTU）的相對豐度。SourceTracker可以為水槽樣本中的每個OTU觀測序列分配不同的源環境。這十個OTU源對在水槽環境中具有最高的平均相對豐度，不包括未知源。圖例給出了OTU的屬級分類、OTU標識符和分配給這些觀測的源環境。值得注意的是，被歸類為腸桿菌的OTU，一種常見於腸道的譜系，在皮膚訓練樣本中比在腸道訓練樣本中更為普遍。

Figure 3 | Relative abundance of common contaminating operational taxonomic units (OTUs). SourceTracker may assign a different source environment to each observation (sequence) of an OTU in the sink samples. These ten OTU-source pairs had the highest average relative abundance across sink environments, excluding the unknown source. The legend gives the genus-level taxonomic classification of the OTU, the OTU identifier and the source environment assigned to these observations of the OTU. Note that the OTU classified as Enterobacter, a lineage commonly seen in the gut, was more prevalent in the skin training samples than the gut training samples.

文章實戰解讀

Rob Knight-2016-Sciences[2]文章中圖(A) 動態貝葉斯推理網絡: 屍體分解過程微生物分類群神經信息流動網絡，土壤是其主要來源。小鼠屍體四種取樣位置分別為頭、軀幹、腹部和土壤。顏色為3種環境，分別為沙漠、草地和森林，且均與土壤來源微生物非常顯著相關。

更多內容，請閱讀下文：

16S+功能預測發Sciences：屍體降解過程

[3]產道菌群移植對剖腹產嬰兒缺失菌群的恢復：圖中展示了嬰兒三個部位皮膚Skin、口腔Oral、肛門Anal中，腸道菌群組成的來源，隨時間的推移1-30天而發生的改變。

剖腹產的嬰兒患免疫和代謝疾病的風險增高，被認為可能由於缺乏了與母親生殖道分泌物（包括微生物）的接觸；母親生殖道的分泌物會覆蓋順產嬰兒的全身，促進了嬰兒口腔、腸道、皮膚菌群的定殖，以及對嬰兒的保護作用；對剖腹產嬰兒塗抹分泌物，隨時間推移，其各部位菌群特徵逐漸趨向於順產嬰兒。該方法可以部分的恢復剖腹產嬰兒菌群，但對健康的長期影響有待觀察，以及樣本量也需要擴大。

軟體安裝

SourceTracker是一個R腳本，最新版本地址：https://github.com/danknights/sourcetracker ，版本1.0，2016年9月18日更新

# 下載腳本和測試數據
git clone git@github.com:danknights/sourcetracker.git
cd sourcetracker/

example.r使用data目錄中的OTU表和mapping實現文章中的分析實例

# 運行測試數據
Rscript example.r

運行中顯示如下結果:

Rarefying training data at 1000
Gut Oral Skin Soil Unknown
.......... 1 of 125, depth= 12: 0.12 (0.09) 0.07 (0.03) 0.11 (0.06) 0.02 (0.04) 0.68 (0.09)
.......... 2 of 125, depth= 20: 0.22 (0.03) 0.08 (0.03) 0.29 (0.05) 0.03 (0.04) 0.39 (0.05)
.......... 3 of 125, depth= 10: 0.60 (0.00) 0.00 (0.00) 0.25 (0.13) 0.00 (0.00) 0.15 (0.13)
.......... 4 of 125, depth= 10: 0.28 (0.06) 0.09 (0.03) 0.03 (0.05) 0.01 (0.03) 0.59 (0.07)
.......... 5 of 125, depth= 2: 0.00 (0.00) 0.00 (0.00) 0.45 (0.16) 0.00 (0.00) 0.55 (0.16)

先將訓練樣本抽平至1000條。再進行125次的重採樣，來源包括腸、口腔、皮膚、土壤和未知五大類來源。結果為以上講解的幾種圖形，建議在Rstudio中打開R腳本繪圖，交互探索結果。

輸入文件準備

按照data/metadata.txt標準的實驗設計和data/otus.txt標準的OTU表，在example.r中修改對應的分組，即可分析自己的數據了，非常easy。最好按照示例的文件填寫內容，減少代碼修改直接運行，以下有實驗設計信息說明：

實驗設計

示例文件：data/metadata.txt

主要包括的列有：樣品名、描述、環境Env、來源SourceSink、研究、細節

你的實驗設計必須有前4列

#SampleID列：樣品編號，以英文字母開頭，最好只包括字母和數字，必須與OTU表行名一致
Description列：樣品名，即樣本名稱，可以包括空格，非字符，展示為圖為標籤方便閱讀理解
Env列：為樣品來源注釋列，包括上面輸出的Gut、Oral、Skin、Soil，主要為採樣來源
SourceSink列，要計算的樣品標Sink，而來源數據標Source

#SampleID Description Env SourceSink Study Details
Run20100430_H2O-1 PCR water 1 Lab 1 sink Lab 1 PCR_water_sample_1_2010_04_30_run
Run20100430_H2O-2 PCR water 2 Lab 1 sink Lab 1 PCR_water_sample_2_2010_04_30_run
Run20100430_H2O-3 PCR water 3 Lab 1 sink Lab 1 PCR_water_sample_3_2010_04_30_run
BB2 Spodosol 1 Soil source 88_Soils NA
IT2 Spodosol 2 Soil source 88_Soils NA
CL3 Ultisol Soil source 88_Soils NA

OTU表

示例文件：data/otus.txt

QIIME導出biom後的經典表格格式：行為OTU編號，列為樣本名，矩陣對應為原始測序數量

# QIIME-formatted OTU table
#OTU ID Run20100430_ESC_C-1ss Run20100430_ESC_C-2ss Run20100430_ESC_C-3ss Run20100430_ESC_C-4ss Run20100430_ESC_C-5ss
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
17 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

代碼中文解讀

# This runs SourceTracker on the original "contamination" data set
# (data included in 'data' folder)

# load sample metadata
metadata <- read.table('data/metadata.txt',sep='\t',h=T,row.names=1,check=F,comment='')

# load OTU table
# This 'read.table' command is designed for a
# QIIME-formatted OTU table.
# namely, the first line begins with a '#' sign
# and actually _is_ a comment; the second line
# begins with a '#' sign but is actually the header
# 讀取行、列名，跳過一行，注釋為空即讀#號行
otus <- read.table('data/otus.txt',sep='\t', header=T,row.names=1,check=F,skip=1,comment='')
# 讀入數據框變矩陣，且轉置為樣本為行的舊格式
otus <- t(as.matrix(otus))

# extract only those samples in common between the two tables
common.sample.ids <- intersect(rownames(metadata), rownames(otus))
otus <- otus[common.sample.ids,]
metadata <- metadata[common.sample.ids,]
# double-check that the mapping file and otu table
# had overlapping samples
# 判斷是否存在共有樣品，否則退出
if(length(common.sample.ids) <= 3) {
message <- paste(sprintf('Error: there are %d sample ids in common ',length(common.sample.ids)),
'between the metadata file and data table')
stop(message)
}

# extract the source environments and source/sink indices
# 篩選哪些是來源或目標真假T/F，which轉化為位置編號
# 共篩選訓練集180個，測試集125個
train.ix <- which(metadata$SourceSink=='source')
test.ix <- which(metadata$SourceSink=='sink')
# 測試集太多，只保留6個樣品做演示
test.ix = head(test.ix)
envs <- metadata$Env
# 判斷是否存在Description列，存在列保存於desc
if(is.element('Description',colnames(metadata))) desc <- metadata$Description

# load SourceTracker package
# 加載軟體包
source('src/SourceTracker.r')

# tune the alpha values using cross-validation (this is slow!)
# 使用交叉驗證調整alpha值，非常耗時
# tune.results <- tune.st(otus[train.ix,], envs[train.ix])
# alpha1 <- tune.results$best.alpha1
# alpha2 <- tune.results$best.alpha2
# note: to skip tuning, run this instead:
# 跳過優化alpha值步驟，直接設置為0.001繼續計算
alpha1 <- alpha2 <- 0.001

# train SourceTracker object on training data
# 基於訓練集和對應描述獲得預測模型
st <- sourcetracker(otus[train.ix,], envs[train.ix])

# Estimate source proportions in test data
# 估計測試集中來源比例
results <- predict(st,otus[test.ix,], alpha1=alpha1, alpha2=alpha2)

# Estimate leave-one-out source proportions in training data
# 在訓練集中留一法（一種交叉驗證方法）估計來源比例，計算次數等於訓練集樣本數量，極耗時
# results.train <- predict(st, alpha1=alpha1, alpha2=alpha2)

# plot results
# 結果繪圖, 將環境和描述列合併作為標籤展示
labels <- sprintf('%s %s', envs,desc)
# 繪製餅形圖比例
plot(results, labels[test.ix], type='pie')

# other plotting functions
plot(results, labels[test.ix], type='bar')
plot(results, labels[test.ix], type='dist')
# plot(results.train, labels[train.ix], type='pie')
# plot(results.train, labels[train.ix], type='bar')
# plot(results.train, labels[train.ix], type='dist')

# plot results with legend
# 添加圖例，並人工指定顏色
plot(results, labels[test.ix], type='pie', include.legend=TRUE, env.colors=c('#47697E','#5B7444','#CC6666','#79BEDB','#885588'))
plot(results, labels[test.ix], type='pie', include.legend=TRUE, env.colors=rainbow(5))

最後結果繪圖，可選餅形圖(pie)、柱狀圖(bar)和堆疊圖(dist)，如上面示例所示。

Knights D, Kuczynski J, Charlson ES, et al. Bayesian community-wide culture-independent microbial source tracking [J]. Nature Methods, 2011, 8(9): 761. DOI: 10.1038/nmeth.1650
Metcalf, J. L., et al. (2016). 「Microbial community assembly and metabolic function during mammalian corpse decomposition.」 Science 351(6269): 158-162.
Dominguez-Bello MG, De JKM, Nan S, et al. Partial restoration of the microbiota of cesarean-born infants via vaginal microbial transfer [J]. Nature Medicine, 2016, 22(3): 250-253.
https://www.nature.com/articles/nmeth.1650
https://www.ncbi.nlm.nih.gov/pubmed/21765408
銳翌16S分析升級之③ SourceTracker — 尋找微生物的來源 https://mp.weixin.qq.com/s/eAD42C8ZZAcHBXWmr6HnUw

Nature Methods：微生物來源分析包SourceTracker——結果解讀和使用教程