Inquiring of cancer related genes by two layer artificial neural networks (以雙層類神經網路查詢癌症相關基因之範例)

In this analysis, we used the same two-layer neural network algorithm and paper data scale as the previous report. Trying to find out the conclusions of about 470,000 medical journals in the last month, which cancer types and cancer related genes are mentioned. The values adopted for data entry are the two most popular terms of "lung cencer" and "EGFR". A few seconds after the two words are sent out, a series of vocabulary outputs can be obtained, and a two-dimensional projection is generated based on the similarity of the word vectors.
在這次的分析測試中,我們採用與上一篇報告相同的雙層類神經網路演算法與論文資料規模,嘗試查詢最近一個月約 47萬篇醫學期刊的結論中,提及了哪些癌症種類與癌症相關基因。資料輸入採用的數值,是近期相當熱門的「肺癌 lung cencer」及「表皮生長因子 EGFR」這兩個英文詞彙。在送出這兩個詞彙之後的數秒鐘,就能得到一系列的詞彙輸出,並依據其詞向量的相似度,產生二維投影分布如上圖。

Since data only use the conclusions of recent medical research, the two-dimensional projection positions and distances between vocabularies may not necessarily represent correlations or similarities. Even so, the vocabulary extracted from research papers by neural networks is still quite accurate and meaningful. According to the cancer name or gene name, the list is organized as follows:

Name of cancers
Name of cancer related genes

(input value 輸入值Lung cancer

(Output value 輸出值)NSCLC, HNSCC, lung adenocarcinoma, ovarian cancer, TNBC, SCLC, RCC, pancreatic cancer, lung cancers, melanoma, CRC, bladder cancer, glioblastoma, gastric cancer, PDAC, ESCC, colorectal cancer, colon cancer, GBM, osteosarcoma, breast cancer, PCa, ccRCC, thyroid cancer, EOC, PTC, prostate cancer, breast cancers, AML, ovarian carcinoma OSCC, glioma, esophageal cancer, GIST, CRPC, NPC, cervical cancer, MPM, colorectal carcinoma, non-small cell, NSCLCs

(input value 輸入值EGFR

(Output value 輸出值)HER2, KRAS, ALK, BRAF, PD-L1, MET, p53, EGFR mutations, FGFR1

In terms of the vocabulary description structure, except for a slight lack of a word (non-small cell should be non-small cell lung cancer), the acquisition of other cancer-related names is quite correct. For cancer-related genes, eight other important gene names are correctly queried. Among them, HER2 is a gene highly associated with breast cancer; KRAS, PD-L1, MET, p53, FGFR1, and BRAF are genes highly associated with various cancers; EGFR and ALK are associated with some types of lung cancer or other types of cancer. In addition, neural networks believe that the phrase EGFR mutations is similar to the noun EGFR, and it also means that these two vocabularies have the same important significance in medical conclusions.
在詞彙的描述結構上,除了一個詞彙略有缺失之外(non-small cell 應為 non-small cell lung cancer),其餘癌症相關名稱的擷取都相當正確。而對於癌症相關基因,則正確查詢出另外八個重要的基因名稱。其中,HER2是與乳癌高度相關的基因;KRAS、PD-L1、MET、p53、FGFR1與BRAF是與多種癌症高度相關的基因;EGFR與ALK是與部分種類的肺癌或其他癌症相關的基因。除此之外,類神經網路認為 EGFR mutations 這個名詞片語與 EGFR 名詞本身有相似性,也代表這兩個詞彙在醫學結論中有同樣重要的意義。

Looking at the above results, we can prove that the neural network can successfully extract the important research vocabulary in the paper under such experimental conditions. If the input and output range of the data is extended, the required analysis results can be more comprehensively obtained; if the input of the query is a less popular term in the non-cancer field, the neural network can also respond to correct results. This is also the most successful part of artificial intelligence compared with human thinking. Finally, welcome to contact us, if you have any needs about large-scale data analysis .