News media profiling (reliability, political bias, factual reporting, etc.) is a well-known crucial initial step for fact-checking systems and the primary consideration for journalists when manually verifying the trustworthiness of information. Despite the significantly large AI community reporting advances on short-term news media bias at the article level (i.e., based on content), not much attention has been paid to long-term bias descriptors at the media profiling level. To address this gap and with the aim of bringing an automatic support for news media producers and an awareness tool for fact-checking systems, we focused our research to address the following question: “to what extent can we profile a news media source solely based on its interactions with other sources on the Web?”
With the aim of answering this question, we built a news media graph from the Web and then applied four reinforcement learning strategies to infer three descriptor values: reliability, political bias and factual reporting. More precisely, we first construct a weighted directed graph. Given two news websites, newsA.com (A) and newsB.info (B), the news source A is connected to B, if the source A contains articles (hyper) linked to B and where the weight from A to B is a value between 0 and 1, i.e., the proportion of total hyperlinks in A linked to B. We hypothesize that certain properties of a news source A can be estimated from the sources it interacts with, by inheriting their properties, e.g. sources interacting primarily with unreliable/reliable sources may have higher chances of being unreliable/reliable too.
In the figure to the right, we illustrate the target property in colors, for which some sources we know the value in advance. These known values are converted to reward values which are either positive (green) for known reliable sources or negative (red) for unreliable. The edges illustrate the probability of reaching B from A, computed as the proportion of hyperlinks in A connecting to B represented as the arrow’s thickness.
In case of reliability, the reinforcement learning techniques applied to compute the rewards are explained in detail in our accepted paper: “Reliability Estimation of News Media Sources: Birds of a Feather Flock Together”, that can be downloaded from the CRITERIA publication portal.
Our results revealed significantly high performance in the inferences of the reliable class, and we consistently observed that identifying unreliable sources is more challenging. Moreover, we assessed the quality of the estimated values in terms of correlation with reliability scores provided by journalists.
The overall contributions of our work are as follows:
- We propose to model the source reliability estimation problem in an independent real-world scalable scenario as a reliability degree (i.e. a continuous value) rather than a categorical value;
- We lead the implementation of various algorithms to estimate reliability scores, exploring a spectrum from vanilla reinforcement learning strategies to task-specific variations;
- We build the largest news media reliability dataset available;
- We provide empirical evidence demonstrating that predicting a news media source’s reliability based solely on its interactions with other sources is feasible.
We are looking forward to presenting our work on June 19th in Mexico City as an oral presentation at the North American Chapter of the Association for Computational Linguistics (NAACL), one of the top NLP conferences.
Last but not least, we successfully validated our findings by complementing the work described in our previous blog Towards Factuality Assessment, where we described the Claim-Dissector, a tool that computes a verdict (Support, Reject or not Enough Information) using textual evidence based on a given Claim. Currently, the enhanced Claim-Dissector uses the News Media Reliability score as a relevant feature in the computation of the more accurate Verdict. Furthermore, we are currently exploring the extension to other source media profiling including Factual Reporting and Political Bias.
This work adds to the CRITERIA project aiming to identify risk interactions and patterns of bias propagation in news media in the context of migration.
Dairazalia Sanchez-Cortes, PhD
Dairazalia Sanchez-Cortes is currently working as a Postdoctoral Researcher at the Idiap Research Institute in Switzerland. She joined the Speech and Audio Processing Group in 2023. She holds a PhD in Sciences from the EPFL’s School of Engineering. Her research interests include machine learning, human activity modeling, nonverbal behavior and applied research.
Sergio Burdisso, PhD
Sergio Burdisso is currently a Postdoctoral Researcher at Idiap research institute in Switzerland. He's actively collaborating with Dr. Petr Motlicek in the Speech and Audio Processing Group. Sergio holds a Ph.D. in Computer Science specialized in Natural Language Processing applied to Early Risk Identification on Social Media. His main research interests include topics such as interpretable machine learning, few-shot learning, and representation learning for dialogue modeling.
Dr. Petr Motlicek
Petr Motlicek has been a research scientist in the Speech and Audio Processing Group since 2005 at the Idiap Research Institute in Switzerland. His research activities are focused on audio and speech processing technologies (voice coding and recognition, and speaker recognition), conversation analysis and machine learning. Many of the designed applications are developed in collaboration with security/government (LEA) bodies in Switzerland, or at the EU level. He has significantly contributed to Kaldi- open-source software developed for speech and speaker recognition tasks, with many new libraries for signal processing being provided by Idiap.
Sources:
- Sergio Burdisso, Dairazalia Sánchez-Cortés, Esaú Villatoro-Tello, Petr Motlicek. “Reliability Estimation of News Media Sources: Birds of a Feather Flock Together” In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), 2024. Mexico City, Mexico. Association for Computational Linguistics. https://doi.org/10.48550/arXiv.2404.09565