Clustering of News in Publications

Sülün, Erhan

Please use this identifier to cite or link to this item: https://hdl.handle.net/20.500.11779/1190

Title:	Clustering of News in Publications
Other Titles:	Yayınlarda yer alan makalelerin gruplanması
Authors:	Sülün, Erhan
Advisors:	Arısoy Saraçlar, Ebru
Publisher:	MEF Üniversitesi, Fen Bilimleri Enstitüsü
Source:	Sülün, E. (2018). Clustering of news in publications, MEF Üniversitesi Fen Bilimleri Enstitüsü, İstanbul, Türkiye
Abstract:	In today’s world, high volume of text is produced and stored continuously by the help of computer systems and Internet. And again by the help of Internet, those huge amount of text data is accessible to everyone. But when considering the size of the produced text, it is really hard for people to analyze the huge amounts of text data and discover the meaningful information in that data. Machine learning techniques and computer power emerges at this point, in order to analyze data and discover meaningful information to help people to access the summarized information. First step to analyze text data is to represent data in a numerical format, as machine learning techniques can only use numerical inputs. There are several methods for data representation; such as TF-IDF (Term Frequency - Inverse Document Frequency), Bag of Words, Word2Vec and Doc2Vec. Second step to analyze text data is to use machine learning algorithms by using the numerical representation of text data as input. There are supervised and unsupervised machine learning techniques to be decided to be used according to the structure of the problem and the data. In this study, news documents published in some publications in United States, such as New York Times, Reuters and Washington Post will be clustered into topics in order to categorize them and ease the investigation of them. Three types of data representation methods will be examined in detail and will be used, which are Bag of Words, TF-IDF and Doc2Vec representations. And finally, as the news data is an unlabeled set of documents, K-Means clustering algorithm will be used which is an unsupervised learning technique, by using both Euclidean Distance and Cosine Similarity metrics. Categorization will be performed multiple times with different category counts, meaning with different K values, and most meaningful category count will be determined after examining the clustering results. Günümüz dünyasında, bilgisayar teknolojileri ve İnternet’in de desteği ile, çok yüksek hacimde metin verisi düzenli olarak üretilmekte ve saklanmaktadır. Yine İnternet sayesinde bu yüksek hacimli metin verisi herkesin erişimine açık durumdadır. Ancak verinin yüksek hacmini düşündüğümüzde, bu verinin insanlar tarafından analiz edilip, bu veriden anlamlı bilgilerin çıkarılması çok zordur. Tam bu noktada makine öğrenmesi ve bilgisayar işlem gücü devreye girmekte ve veriyi analiz edip anlamlı bilgileri tespit ederek insanların özet bilgiye ulaşmasına yardımcı olmaktadır. Makine öğrenmesi teknikleri yalnızca sayısal girdi ile çalışabildikleri için metin analizindeki ilk adım metnin sayısal bir gösterime dönüştürülmesi adımıdır. Metinlerin sayısal gösterimine ilişkin olarak geliştirilen yöntemlerden birkaç tanesi; ‘TF-IDF (Term Frequency – Inverse Document Frequency)’, ‘Bag of Words’, ‘Word2Vec’ ve ‘Doc2Vec’ yöntemleridir. İkinci adım ise üretilmiş sayısal gösterimin girdi olarak kullanılacağı makine öğrenmesi yönteminin devreye alınmasıdır. Makine öğrenmesi metotları denetimli ve denetimsiz olarak ayrılmaktadır ve ilgili problemin ve girdi verisinin yapısı dikkate alınarak hangi tür öğrenmenin kullanılacağına karar verilmektedir. Bu çalışmada, Amerika’daki New York Times, Reuters ve Washington Post gibi yayın organlarında yayınlanmış haber metinlerinin konu gruplarına ayrılması ve bu sayede kategorize edilerek ilgili haberlerin araştırılmasının kolaylaştırılmasının sağlanması amaçlanmıştır. Metinsel verinin gösterim yöntemi olarak Bag of Words, TF-IDF ve Doc2Vec yöntemleri detaylı olarak incelenecek ve kullanılacaktır. Son olarak, konu gruplarına ayıracağımız metinler etiketlenmemiş bir metin seti olduğu için denetimsiz türde bir makine öğrenmesi yöntemi olan ‘K-Means’ algoritması, Euclidean Distance ve Cosine Similarity metrikleri ile birlikte kullanılarak ilgili gruplama yapılacaktır. Gruplama işlemi farklı sayıda kategoriler kullanılarak, yani farklı K değerleri ile birden fazla kez yapılacak ve en anlamlı kategori sayısı gruplama sonuçları incelenerek belirlenecektir.
URI:	https://hdl.handle.net/20.500.11779/1190
Appears in Collections:	FBE, Yüksek Lisans, Proje Koleksiyonu