Skip to content

Commit 06f4c9c

Browse files
committedJul 2, 2020
Adds the script to generate the malignancy annotation file
1 parent fb46b90 commit 06f4c9c

File tree

1 file changed

+433
-0
lines changed

1 file changed

+433
-0
lines changed
 
+433
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,433 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# Malignancy Annotations\n",
8+
"\n",
9+
"This notebook compiles the `annotations_with_malignancy.csv` and also drops annotations for CTs it cannot find.\n",
10+
"\n",
11+
"In addition to the usual suspects, you need to have the `pylidc` Python package (use `pip install pylidc` or [check out the source](https://pylidc.github.io/)."
12+
]
13+
},
14+
{
15+
"cell_type": "code",
16+
"execution_count": 1,
17+
"metadata": {},
18+
"outputs": [],
19+
"source": [
20+
"import torch\n",
21+
"import SimpleITK as sitk\n",
22+
"import pandas\n",
23+
"import glob, os\n",
24+
"import numpy\n",
25+
"import tqdm\n",
26+
"import pylidc\n"
27+
]
28+
},
29+
{
30+
"cell_type": "markdown",
31+
"metadata": {},
32+
"source": [
33+
"We first load the annotations from the LUNA challenge."
34+
]
35+
},
36+
{
37+
"cell_type": "code",
38+
"execution_count": 2,
39+
"metadata": {},
40+
"outputs": [],
41+
"source": [
42+
"annotations = pandas.read_csv('data/part2/luna/annotations.csv')"
43+
]
44+
},
45+
{
46+
"cell_type": "markdown",
47+
"metadata": {
48+
"scrolled": false
49+
},
50+
"source": [
51+
"For the CTs where we have a `.mhd` file, we collect the malignancy_data from PyLIDC.\n",
52+
"\n",
53+
"It is a bit tedious as we need to convert the pixel locations provided by PyLIDC to physical points.\n",
54+
"We will see some warnings about annotations to be too close too each other (PyLIDC expects to have 4 annotations per site, see Chapter 14 for some details, including when we consider a nodule to be malignant).\n",
55+
"\n",
56+
"This takes quite a while (~1-2 seconds per scan on one of the author's computer)."
57+
]
58+
},
59+
{
60+
"cell_type": "code",
61+
"execution_count": 3,
62+
"metadata": {},
63+
"outputs": [
64+
{
65+
"name": "stderr",
66+
"output_type": "stream",
67+
"text": [
68+
" 11%|█▏ | 69/601 [01:52<13:05, 1.48s/it]"
69+
]
70+
},
71+
{
72+
"name": "stdout",
73+
"output_type": "stream",
74+
"text": [
75+
"Failed to reduce all groups to <= 4 Annotations.\n",
76+
"Some nodules may be close and must be grouped manually.\n"
77+
]
78+
},
79+
{
80+
"name": "stderr",
81+
"output_type": "stream",
82+
"text": [
83+
" 15%|█▌ | 93/601 [02:31<14:46, 1.75s/it]"
84+
]
85+
},
86+
{
87+
"name": "stdout",
88+
"output_type": "stream",
89+
"text": [
90+
"Failed to reduce all groups to <= 4 Annotations.\n",
91+
"Some nodules may be close and must be grouped manually.\n"
92+
]
93+
},
94+
{
95+
"name": "stderr",
96+
"output_type": "stream",
97+
"text": [
98+
" 18%|█▊ | 107/601 [02:53<14:35, 1.77s/it]"
99+
]
100+
},
101+
{
102+
"name": "stdout",
103+
"output_type": "stream",
104+
"text": [
105+
"Failed to reduce all groups to <= 4 Annotations.\n",
106+
"Some nodules may be close and must be grouped manually.\n"
107+
]
108+
},
109+
{
110+
"name": "stderr",
111+
"output_type": "stream",
112+
"text": [
113+
" 37%|███▋ | 225/601 [06:16<11:28, 1.83s/it]"
114+
]
115+
},
116+
{
117+
"name": "stdout",
118+
"output_type": "stream",
119+
"text": [
120+
"Failed to reduce all groups to <= 4 Annotations.\n",
121+
"Some nodules may be close and must be grouped manually.\n"
122+
]
123+
},
124+
{
125+
"name": "stderr",
126+
"output_type": "stream",
127+
"text": [
128+
" 44%|████▍ | 267/601 [07:24<07:51, 1.41s/it]"
129+
]
130+
},
131+
{
132+
"name": "stdout",
133+
"output_type": "stream",
134+
"text": [
135+
"Failed to reduce all groups to <= 4 Annotations.\n",
136+
"Some nodules may be close and must be grouped manually.\n"
137+
]
138+
},
139+
{
140+
"name": "stderr",
141+
"output_type": "stream",
142+
"text": [
143+
" 47%|████▋ | 281/601 [07:46<09:37, 1.80s/it]"
144+
]
145+
},
146+
{
147+
"name": "stdout",
148+
"output_type": "stream",
149+
"text": [
150+
"Failed to reduce all groups to <= 4 Annotations.\n",
151+
"Some nodules may be close and must be grouped manually.\n"
152+
]
153+
},
154+
{
155+
"name": "stderr",
156+
"output_type": "stream",
157+
"text": [
158+
" 61%|██████ | 368/601 [10:16<06:19, 1.63s/it]"
159+
]
160+
},
161+
{
162+
"name": "stdout",
163+
"output_type": "stream",
164+
"text": [
165+
"Failed to reduce all groups to <= 4 Annotations.\n",
166+
"Some nodules may be close and must be grouped manually.\n"
167+
]
168+
},
169+
{
170+
"name": "stderr",
171+
"output_type": "stream",
172+
"text": [
173+
" 72%|███████▏ | 434/601 [11:57<03:41, 1.32s/it]"
174+
]
175+
},
176+
{
177+
"name": "stdout",
178+
"output_type": "stream",
179+
"text": [
180+
"Failed to reduce all groups to <= 4 Annotations.\n",
181+
"Some nodules may be close and must be grouped manually.\n"
182+
]
183+
},
184+
{
185+
"name": "stderr",
186+
"output_type": "stream",
187+
"text": [
188+
" 74%|███████▍ | 446/601 [12:20<03:09, 1.22s/it]"
189+
]
190+
},
191+
{
192+
"name": "stdout",
193+
"output_type": "stream",
194+
"text": [
195+
"Failed to reduce all groups to <= 4 Annotations.\n",
196+
"Some nodules may be close and must be grouped manually.\n"
197+
]
198+
},
199+
{
200+
"name": "stderr",
201+
"output_type": "stream",
202+
"text": [
203+
" 75%|███████▍ | 450/601 [12:26<03:49, 1.52s/it]"
204+
]
205+
},
206+
{
207+
"name": "stdout",
208+
"output_type": "stream",
209+
"text": [
210+
"Failed to reduce all groups to <= 4 Annotations.\n",
211+
"Some nodules may be close and must be grouped manually.\n"
212+
]
213+
},
214+
{
215+
"name": "stderr",
216+
"output_type": "stream",
217+
"text": [
218+
" 88%|████████▊ | 527/601 [14:15<01:35, 1.29s/it]"
219+
]
220+
},
221+
{
222+
"name": "stdout",
223+
"output_type": "stream",
224+
"text": [
225+
"Failed to reduce all groups to <= 4 Annotations.\n",
226+
"Some nodules may be close and must be grouped manually.\n"
227+
]
228+
},
229+
{
230+
"name": "stderr",
231+
"output_type": "stream",
232+
"text": [
233+
" 96%|█████████▌| 577/601 [15:17<00:38, 1.59s/it]"
234+
]
235+
},
236+
{
237+
"name": "stdout",
238+
"output_type": "stream",
239+
"text": [
240+
"Failed to reduce all groups to <= 4 Annotations.\n",
241+
"Some nodules may be close and must be grouped manually.\n"
242+
]
243+
},
244+
{
245+
"name": "stderr",
246+
"output_type": "stream",
247+
"text": [
248+
" 99%|█████████▉| 597/601 [15:44<00:06, 1.66s/it]"
249+
]
250+
},
251+
{
252+
"name": "stdout",
253+
"output_type": "stream",
254+
"text": [
255+
"Failed to reduce all groups to <= 4 Annotations.\n",
256+
"Some nodules may be close and must be grouped manually.\n"
257+
]
258+
},
259+
{
260+
"name": "stderr",
261+
"output_type": "stream",
262+
"text": [
263+
"100%|██████████| 601/601 [15:48<00:00, 1.58s/it]\n"
264+
]
265+
}
266+
],
267+
"source": [
268+
"malignancy_data = []\n",
269+
"missing = []\n",
270+
"spacing_dict = {}\n",
271+
"scans = {s.series_instance_uid:s for s in pylidc.query(pylidc.Scan).all()}\n",
272+
"suids = annotations.seriesuid.unique()\n",
273+
"for suid in tqdm.tqdm(suids):\n",
274+
" fn = glob.glob('./data-unversioned/part2/luna/subset*/{}.mhd'.format(suid))\n",
275+
" if len(fn) == 0 or '*' in fn[0]:\n",
276+
" missing.append(suid)\n",
277+
" continue\n",
278+
" fn = fn[0]\n",
279+
" x = sitk.ReadImage(fn)\n",
280+
" spacing_dict[suid] = x.GetSpacing()\n",
281+
" s = scans[suid]\n",
282+
" for ann_cluster in s.cluster_annotations():\n",
283+
" # this is our malignancy criteron described in Chapter 14\n",
284+
" is_malignant = len([a.malignancy for a in ann_cluster if a.malignancy >= 4])>=2\n",
285+
" centroid = numpy.mean([a.centroid for a in ann_cluster], 0)\n",
286+
" bbox = numpy.mean([a.bbox_matrix() for a in ann_cluster], 0).T\n",
287+
" coord = x.TransformIndexToPhysicalPoint([int(numpy.round(i)) for i in centroid[[1, 0, 2]]])\n",
288+
" bbox_low = x.TransformIndexToPhysicalPoint([int(numpy.round(i)) for i in bbox[0, [1, 0, 2]]])\n",
289+
" bbox_high = x.TransformIndexToPhysicalPoint([int(numpy.round(i)) for i in bbox[1, [1, 0, 2]]])\n",
290+
" malignancy_data.append((suid, coord[0], coord[1], coord[2], bbox_low[0], bbox_low[1], bbox_low[2], bbox_high[0], bbox_high[1], bbox_high[2], is_malignant, [a.malignancy for a in ann_cluster]))\n"
291+
]
292+
},
293+
{
294+
"cell_type": "markdown",
295+
"metadata": {},
296+
"source": [
297+
"You can check how many `mhd`s you are missing. It seems that the LUNA data has dropped a couple(?). Don't worry if there are <10 missing."
298+
]
299+
},
300+
{
301+
"cell_type": "code",
302+
"execution_count": 4,
303+
"metadata": {},
304+
"outputs": [
305+
{
306+
"name": "stdout",
307+
"output_type": "stream",
308+
"text": [
309+
"MISSING []\n"
310+
]
311+
}
312+
],
313+
"source": [
314+
"print(\"MISSING\", missing)"
315+
]
316+
},
317+
{
318+
"cell_type": "markdown",
319+
"metadata": {},
320+
"source": [
321+
"We stick the data we got from PyLIDC into a DataFrame."
322+
]
323+
},
324+
{
325+
"cell_type": "code",
326+
"execution_count": 5,
327+
"metadata": {},
328+
"outputs": [],
329+
"source": [
330+
"df_mal = pandas.DataFrame(malignancy_data, columns=['seriesuid', 'coordX', 'coordY', 'coordZ', 'bboxLowX', 'bboxLowY', 'bboxLowZ', 'bboxHighX', 'bboxHighY', 'bboxHighZ', 'mal_bool', 'mal_details'])"
331+
]
332+
},
333+
{
334+
"cell_type": "markdown",
335+
"metadata": {},
336+
"source": [
337+
"And now we match the malignancy data to the annotations. This is a lot faster..."
338+
]
339+
},
340+
{
341+
"cell_type": "code",
342+
"execution_count": 6,
343+
"metadata": {},
344+
"outputs": [
345+
{
346+
"name": "stderr",
347+
"output_type": "stream",
348+
"text": [
349+
"100%|██████████| 601/601 [00:01<00:00, 316.12it/s]\n"
350+
]
351+
}
352+
],
353+
"source": [
354+
"processed_annot = []\n",
355+
"annotations['mal_bool'] = float('nan')\n",
356+
"annotations['mal_details'] = [[] for _ in annotations.iterrows()]\n",
357+
"bbox_keys = ['bboxLowX', 'bboxLowY', 'bboxLowZ', 'bboxHighX', 'bboxHighY', 'bboxHighZ']\n",
358+
"for k in bbox_keys:\n",
359+
" annotations[k] = float('nan')\n",
360+
"for series_id in tqdm.tqdm(annotations.seriesuid.unique()):\n",
361+
" # series_id = '1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222365663678666836860'\n",
362+
" # c = candidates[candidates.seriesuid == series_id]\n",
363+
" a = annotations[annotations.seriesuid == series_id]\n",
364+
" m = df_mal[df_mal.seriesuid == series_id]\n",
365+
" if len(m) > 0:\n",
366+
" m_ctrs = m[['coordX', 'coordY', 'coordZ']].values\n",
367+
" a_ctrs = a[['coordX', 'coordY', 'coordZ']].values\n",
368+
" #print(m_ctrs.shape, a_ctrs.shape)\n",
369+
" matches = (numpy.linalg.norm(a_ctrs[:, None] - m_ctrs[None], ord=2, axis=-1) / a.diameter_mm.values[:, None] < 0.5)\n",
370+
" has_match = matches.max(-1)\n",
371+
" match_idx = matches.argmax(-1)[has_match]\n",
372+
" a_matched = a[has_match].copy()\n",
373+
" # c_matched['diameter_mm'] = a.diameter_mm.values[match_idx]\n",
374+
" a_matched['mal_bool'] = m.mal_bool.values[match_idx]\n",
375+
" a_matched['mal_details'] = m.mal_details.values[match_idx]\n",
376+
" for k in bbox_keys:\n",
377+
" a_matched[k] = m[k].values[match_idx]\n",
378+
" processed_annot.append(a_matched)\n",
379+
" processed_annot.append(a[~has_match])\n",
380+
" else:\n",
381+
" processed_annot.append(c)\n",
382+
"processed_annot = pandas.concat(processed_annot)\n",
383+
"processed_annot.sort_values('mal_bool', ascending=False, inplace=True)\n",
384+
"processed_annot['len_mal_details'] = processed_annot.mal_details.apply(len)"
385+
]
386+
},
387+
{
388+
"cell_type": "markdown",
389+
"metadata": {},
390+
"source": [
391+
"Finally, we drop NAs (where we didn't find a match) and save it in the right place."
392+
]
393+
},
394+
{
395+
"cell_type": "code",
396+
"execution_count": 7,
397+
"metadata": {},
398+
"outputs": [],
399+
"source": [
400+
"df_nona = processed_annot.dropna()\n",
401+
"df_nona.to_csv('./data/part2/luna/annotations_with_malignancy.csv', index=False)"
402+
]
403+
},
404+
{
405+
"cell_type": "code",
406+
"execution_count": null,
407+
"metadata": {},
408+
"outputs": [],
409+
"source": []
410+
}
411+
],
412+
"metadata": {
413+
"kernelspec": {
414+
"display_name": "Python 3",
415+
"language": "python",
416+
"name": "python3"
417+
},
418+
"language_info": {
419+
"codemirror_mode": {
420+
"name": "ipython",
421+
"version": 3
422+
},
423+
"file_extension": ".py",
424+
"mimetype": "text/x-python",
425+
"name": "python",
426+
"nbconvert_exporter": "python",
427+
"pygments_lexer": "ipython3",
428+
"version": "3.8.3"
429+
}
430+
},
431+
"nbformat": 4,
432+
"nbformat_minor": 2
433+
}

0 commit comments

Comments
 (0)
Please sign in to comment.