德语新闻文章主题提取数据集-2011-whenamancodes

德语新闻文章主题提取数据集-2011-whenamancodes 数据来源:互联网公开数据 标签:NLP,德语,新闻文章,主题分类,数据集,学术研究,机器学习,新闻学

数据概述: 本数据集包含10273篇德语新闻文章,来自奥地利在线报纸,被分类为九个主题。这些文章是“百万帖子语料库”中尚未被使用的部分。每个文章在语料库中都有一个主题路径,例如“Newsroom/Wirtschaft/Wirtschaftpolitik/Finanzmaerkte/Griechenlandkrise”。10kGNAD数据集使用主题路径的第二部分,如“Wirtschaft”,作为类别标签。文章标题和文本被合并为一个文本,并移除了作者信息,以避免基于作者名称的分类问题。

数据用途概述: 该数据集旨在解决德语文本分类问题,特别是主题分类。研究人员可以使用此数据集训练和评估文本分类器,以提高德语文本处理工具和模型的性能。此外,该数据集还可以作为德语主题分类的基准数据集,帮助其他研究者进行比较和验证。数据集适用于自然语言处理(NLP)教学、文本分类研究、新闻学分析等场景。

引用: @InProceedings{Schabus2017, Author = {Dietmar Schabus and Marcin Skowron and Martin Trapp}, Title = {One Million Posts: A Data Set of German Online Discussions}, Booktitle = {Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR)}, Pages = {12411244}, Year = {2017}, Address = {Tokyo, Japan}, Doi = {10.1145/3077136.3080711}, Month = aug } @InProceedings{Schabus2018, author = {Dietmar Schabus and Marcin Skowron}, title = {Academic-Industrial Perspective on the Development and Deployment of a Moderation System for a Newspaper Website}, booktitle = {Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC)}, year = {2018}, address = {Miyazaki, Japan}, month = may, pages = {1602-1605}, abstract = {This paper describes an approach and our experiences from the development, deployment and usability testing of a Natural Language Processing (NLP) and Information Retrieval system that supports the moderation of user comments on a large newspaper website. We highlight some of the differences between industry-oriented and academic research settings and their influence on the decisions made in the data collection and annotation processes, selection of document representation and machine learning methods. We report on classification results, where the problems to solve and the data to work with come from a commercial enterprise. In this context typical for NLP research, we discuss relevant industrial aspects. We believe that the challenges faced as well as the solutions proposed for addressing them can provide insights to others working in a similar setting.}, url = {http://www.lrec-conf.org/proceedings/lrec2018/summaries/8885.html}, }

packageimg

数据与资源

附加信息

字段
版本 1.0
数据集大小 121.39 MiB
最后更新 2025年5月6日
创建于 2025年5月6日
声明 当前数据集部分源数据来源于公开互联网,如果有侵权,请24小时联系删除(400-600-6816)。