Recent academic research about Next Generation Sequencing (近期次世代定序領域的研究彙整)

In this work, we designed a system to automatially search academic papers in the required fields on the NCBI PubMed database. Afterwards, by using the algorithms of segmentation and keyword comparison, the precise academic research classification is judged and stored in this cloud system.
在這次的論文彙整工作中,我們藉由自行設計程式,使雲端處理系統能在 NCBI PubMed 資料庫上定期、且全自動查找所需領域的學術論文。隨後藉由斷詞及關鍵字比對的演算法,判斷獲取論文資料的精確學術研究分類,彙整儲存於雲端處理系統中。

We can use the data processing flow described in the figure above to determine the source of the paper, the type of data, the person or group of information needed, and then provide the required information. The latest research presentation will be delivered to the person who subscribes to this service in the most appropriate display format. In this case, we took the “next generation sequencing” method of molecular biology research that has become popular in recent years as an example. We used the data obtained in April 2018 to illustrate the effectiveness of this system and the simple judgment of the academic research content in this field.
我們可以藉由上圖所述的資料處理流程,判斷論文資料來源、資料類型、所需資料人員或群組,而後提供所需的訊息。最新的研究發表會以最合適的顯示格式,傳遞給訂閱此項服務的人員。在這次的例子中,我們以近年來較熱門的分子生物學研究方法「次世代定序」為例,以 2018年 4月份取得的資料說明本系統的成效,以及該領域學術研究內容的簡易判斷。

On the current system, we use NCBI PubMed to find the results of papers and records automatically with the data query frequency of six times per day, and then classify the papers in this field according to the judgment words refer to“Next Generation Sequencing”. After obtaining the classified paper in April 2018, we use the TF-IDF algorithm to analyze the word frequency of the title and content. After obtaining the word frequency, we can draw the corresponding result, "subconscious files", as above.
在目前的系統上,我們以每天六次的資料查詢頻率,向 NCBI PubMed 自動查找論文與紀錄查找結果,隨後依據「次世代定序」的判斷詞彙,將論文分類於這個領域之中。在獲得分類論文後,我們將屬於 2018年 4月份的論文標題與內容詞彙,以 TF-IDF演算法進行「詞頻」分析,在獲得詞頻的數值之後,就能繪製對應的「文字雲」分析結果如上圖。

In the analysis results of the papers published in April, we can see that some words have higher frequency of occurrence, such as gene mutations, gene expression, cancer, tumor, carcinoma, genome, genetics, clinical, disease, treatment, protein expression, species identification and other words. These vocabularies can reflect the areas of research that have been widely used in recent "Next Generation Sequencing" methods, or the academic research topics that scientists try to use this technology to explore.
在 4月份發表論文的彙整結果中,我們可以看到一些詞彙具備較高的出現頻率,例如基因突變、基因表現、癌症、腫瘤、癌化、基因體、基因體學、臨床、疾病、治療、蛋白質表現量、物種辨別等詞彙。這些詞彙能反映出近期「次世代定序」研究方法所被廣泛採用的研究領域、或是科學家嘗試使用這項技術來探討的學術研究課題。

At the level of data exploration and marketing, we can use the above analysis method to create a data dimension for the analysis result and another data dimension for time. In this way, we can see the trend of publication of research papers related to "Next Generation Sequencing" under the changes of time. If the corresponding technology promotion activities (marketing behaviors) occur earlier, the research dynamics can be used as a basis for consumers' acceptance or technical competitiveness. If the corresponding technology promotion (marketing behavior) does not exist, then the research dynamics can express the application areas where scientists actively adopt this technology.
在資料探勘與行銷層面的應用上,我們則可以將上述的分析方法,將分析結果建立一個數據維度、且以時間建立另一個數據維度,如此就能看出在時間的變動下,「次世代定序」的相關研究論文的發表趨勢。倘若對應的技術推廣活動(行銷行為)發生較早,則研究動態可做為消費者接受程度、或技術競爭力的判斷依據;倘若對應的技術推廣(行銷行為)並不存在,則研究動態則可表達出科學家主動採用這項技術的應用領域。

In terms of scientists, this service has strengthened the speed of professional knowledge acquisition; in terms of technology providers or publishers, the accuracy of promotional and publication products has been strengthened. If we can further cross-analyze the collection data of this system with the internal data of the service providers, we will have the opportunity to understand the relevance between the specific marketing work, the publication of the customer's academic results and the internal data. Welcome to contact us, if you have a requirement about data analysis or notification service. 
在科學家方面,本系統服務加強了專業知識取得的速度;在技術提供者或出版社方面,則加強了推廣技術與出版品的精確度。如果能進一步將本系統的彙整資料與服務廠商的內部數據交叉分析,則有機會了解特定行銷工作、客戶的學術成果發表與內部數據之間的關聯性。若您有這方面的資料分析或彙整通報服務需求,歡迎與我們聯繫。

備註:

  1. Explanation of TF-IDF word frequency algorithm on Wikipedia
    維基百科上的 TF-IDF詞頻演算法說明:https://zh.wikipedia.org/wiki/Tf-idf
  2. The drawing suite, Highchart, used in this analysis
    本分析所採用的 Highchart繪圖套件:https://www.highcharts.com/demo