Search and early warning of major events, taking oncology news as an example (重大事件搜索與預警,以癌症醫學新聞為例)

In this case study, we selected articles related to oncology medicine in online news and community discussions with specific key words. According to the characteristics of media sources, the data is divided into four categories as shown above. The higher value of the black line, representing oncology medicine related news, comes mainly from more formal media reports or blog posts. The X-axis in the statistical picture shows the date of the last 30 days, and the Y-axis shows the number of articles filtered by the keyword on the day. (The total number of articles in the last 30 days is 235,235.)
在這次的案例分析中,我們以特定的關鍵詞彙篩選網路媒體新聞與社群討論中,與癌症醫學相關的文章。並將資料依據媒體來源的特性,區分成四大類如上圖。其中數值較高的黑線,代表癌症醫學的相關新聞資訊主要來自於較正式的媒體報導或部落格文章。統計圖片中的X軸表示最近30天的日期,Y軸表示當天通過關鍵詞彙篩選的文章數量。(最近30天的總文章數量為23萬5235篇)

From the above visual charts, it is easy to see the peak of three news reports, two of which appeared in April and one in May. According to the characteristics of the reporting peak, we use the 72-hour as time-window to observe the content characteristics of the reporting peak.
從上面的視覺圖表中,可以很容易看到三個新聞報導的高峰,其中兩次出現於四月,一次出現於五月。依據報導高峰的時間特性,我們以72小時為時間視窗,觀測報導高峰中的內容特徵。

The word frequency content generated by the first time-window interval is shown in the 72 hours from April 11 to April 13th. Specific cancer reports are the main characteristics of the content, followed by food and environment factors. If we further explore the causes of the high frequency of words, we can find news related to cancer and related organ removal surgery, news related to the treatment of psoriasis and cutaneous carcinoma, news related to antalgesic and gastric cancer, news related to constipation, diarrhea and colorectal cancer, news related to dietary and cancer initiation/prevention, related to related to traditional Chinese and novel Western medicine for cancer treatment, news related to air pollution and lung cancer, and news related to anti-cancer drugs and import tariffs.
第一個時間視窗區間所產生的詞頻內容,顯示在4月11日至4月13日這72小時中,特定癌症的報導是內容的主要特徵,其次則為食物與環境等生活因素。倘若進一步探究高詞頻產生的原因,則可發現這個時間區間內密集報導癌症與對應器官切除手術的相關新聞、治療乾癬與皮膚癌的相關新聞、止痛藥與胃癌的相關新聞、便秘腹瀉與大腸癌的相關新聞、飲食內容與癌症的相關新聞、中西醫與癌症治療的相關新聞、空氣汙染與肺癌的相關新聞、抗癌藥物與進口關稅的相關新聞。

In the second time-window interval, news reports focused on surgical treatment methods and development of new anticancer drugs. The news report mentioned a large number of company names, showing that there was a high correlation with the company's first quarter financial report and annual shareholder meeting.
在第二的時間視窗的區間中,新聞報導則集中於手術治療方法與抗癌新藥開發進展。新聞報導大量提及公司名稱,顯示這個時期的新聞報導與企業第一季財務報告及年度股東會之間有著高度關聯。

In the third and most recent time-windows, a series of reports involving the death of a celebrity performer has emerged. As a result, accurate disease vocabularies such as lung adenocarcinoma or chronic obstructive pulmonary disease (COPD) have emerged. In the corresponding social atmosphere, there are also a large number of relevant media reports on cancer prevention and cancer treatment.
在第三個、也是最近的一個時間視窗區間中,則出現了演藝名人罹癌過世而引發的一系列報導,也因此出現了如肺腺癌或慢性阻塞性肺病這樣的精確疾病詞彙。在相對應的社會氛圍下,媒體也出現大量關於癌症預防與癌症治療的相關知識性報導。

Summarizing the above statistics and raw data observations, we can find that a large number of homogeneity news will appear in very short time period, and thus generate a statistical peak that can be used as threshold detection condition in early warning system. Word frequency analysis can be used to analyze the extensive vocabulary and precision vocabulary between different types of reports, so that early warning systems have the opportunity to obtain more accurate content analysis capabilities. In conjunction with data detection by Internet social media, we can understand whether news reports have caused widespread public attention and response. In this paradigm analysis, we can find news about the oncology medicine that did not trigger people's response in the three reporting peaks mentioned above (the orange lines in the line graph represent the Internet social media). However, the news of oncology medicine originally focused on social education. Although there is no numerical value reflected in the online social media, a high concentration of news reports is enough to obtain a high number of readers. If such a reporting peak is put into a commercially meaningful message, or is related to a specific organization, we believe this early warning system can show its application value.
總結上述的統計數值與原始資料觀察,我們可以發現同質性的新聞報導會在極短的時間內大量出現,因此產生統計數值的高峰並且可作為預警系統的閾值偵測條件。詞頻分析則可用來分析不同類型報導之間的廣泛性詞彙與精確性詞彙,使預警系統有機會獲得更精確的內容分析能力。配合網路社群媒體的數據偵測,則可了解新聞報導是否廣泛引起民眾注意與回應。在這次的範例分析中,我們可以發現癌症醫學性質的新聞,在上述三個報導高峰中,都沒有引發民眾的回應(折線圖中的橘色線條表示網路社群媒體)。然而癌症醫學報導原本就偏重於社會教育,雖然沒有在網路社群媒體的反映出數值,集中而高篇幅的新聞報導已經足以獲得相當高的讀者觸及人數。倘若這種報導高峰被置入具有商業意義的訊息,或者與特定機構有關連性,我們認為本預警系統就能展現出應用價值。

Welcome to contact us, if you have any analysis needs about news reporting status.
如果您有任何新聞報導相關的狀態分析需求,歡迎與我們聯繫。