下载MetaMap和UMLS,使用MetaMap找到句子中可能包含的所有实体以及实体的CUI,再根据实体CUI与UMLS中的关系三元组进行对齐,标注两个实体间关系.
下载连接 需要注册账号,账号审核大概3天左右。下载完成后进行解压。
- 进入解压后的目录,运行脚本,会打开如下图所示的GUI。
cd 2020AB-full
./run_linux.sh
- 选择
Install UMLS
,Destination一个新建好的空文件夹即可,下面的两个勾可以不进行勾选。
- 点击
OK
进入下一步,这里选择New Configuration
新建配置文件,主要是为了进行数据集的选择。
-
选择默认数据源,这里选择第一个就可以
-
点击
OK
来到下图所示界面,选择子数据集来源。先选择Select source to INCLUDE in subset
这一选项。之后在
Source to Include
列表中选择Diseases DataBase 2000 DDB00 DDB
这一来源 以及NCI
全体。多选按住ctrl
选择一个NCI来源的数据子集,就会提示你是否将NCI来源的全部数据加入子集。这里全部加入。
MetaMap NLM Common Data下载链接、MetaMap NLM Strict Model Data下载链接
-
将上面所需的文件都下载好后,首先解压MetaMap得到public_mm目录文件,之后解压JavaApi,讲解压后bin目录下的可执行文件移至public_mm的bin目录下,将src/javaapi/dist目录下的jar包解压出来,用于后面项目导入。之后将两个下载好的NLM数据集移至public_mm/DB目录下。
-
做完之后,进入public_mm目录下打开终端执行
./bin/install.sh
- 成功后开启metamap服务,下载NLM数据集是为了能够使MetaMap识别特定的医学实体。
./bin/skremdpostctl start
./bin/wsdserverctl start
./bin/mmserver -V NLM
在Idea中新建Maven项目
在pom.xml
加入fastjson
包,用于后面将数据处理为json格式。
<dependencies>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.2.75</version>
</dependency>
</dependencies>
之后在菜单栏点击File
选择Project Structure
在Libraries中添加MetaMap JavaAPI包
之后就可以利用MetaMap JavaApi来处理数据。可以在MetaMap官网中查看Api使用示例,以及文档
MainExtract.java
所提取的数据是从pubmed下载的医学文献摘要中提取出来的句子。
Despite linkage and association between PRAA-SA-LA and the canine DGCR, none of these mutations appeared responsible for PRAA-SA-LA. As the orthologue human region on HSA22q11.2 is known for high susceptibility to genomic rearrangements, we suspect that in German Pinschers, chromosomal aberrations might cause PRAA-SA-LA.
In a human-induced pluripotent stem cell-derived cardiomyocyte (hiPSC-CM) model of cLQTS2 containing the expression defective Kv11.1 mutant A422T, cardiac repolarization, estimated from the duration of calcium transients in isolated cells and the rate corrected field potential duration (FPDc) in culture monolayers of cells, was significantly prolonged.
According to the probably most realistic available calculations, overdiagnosis is acceptable as it is compensated by the potential mortality reduction.
These data suggest that circulating insulin has direct and indirect effects on the synthesis and secretion of GIP.
MainAligned.java
对齐所使用的数据是通过MainExtract.java
处理后的数据,格式如下:
{"sent":"We show that this feature stems from mutations in genes disturbing the capability of the cells to differentiate into a quiescent state, enabling them to divide under restrictive conditions.","entity":[{"pos":[26,31],"cui":"C1184743","name":["stem"],"type":["bpoc"]},{"pos":[26,31],"cui":"C1186763","name":["stems"],"type":["bpoc"]},{"pos":[50,55],"cui":"C0017337","name":["genes"],"type":["gngm"]},{"pos":[178,188],"cui":"C0012634","name":["condition"],"type":["dsyn"]}]}
{"sent":"African populations of Galba spp. are not yet studied using molecular assessments and is imperative to do so and reconstruct the centre of origin of Galba and to understand when and by what means it may have colonized the highlands of Africa and to what extent humans might have been involved in that process.","entity":[{"pos":[29,32],"cui":"C1424276","name":["spp"],"type":["gngm"]},{"pos":[301,308],"cui":"C1184743","name":["process"],"type":["bpoc"]}]}
最后得到的结果为
{"obj":{"pos":[101,104],"cui":"C2239176","name":["hcc"]},"subj":{"pos":[37,40],"cui":"C0079419","name":["p53"]},"sent":"The relationships between MKI67 with p53 and variants of candidate genes in the clinical outcomes of HCC patients were analyzed .","relation":"gene_associated_with_disease"}
{"obj":{"pos":[105,119],"cui":"C0040300","name":["normal","tissue"]},"subj":{"pos":[43,52],"cui":"C0009450","name":["infection"]},"sent":"Immunostaining of HDAC6 expression and HP infection were performed in the following cohort including 21 normal tissues ( Normal ) .","relation":"NA"}