View on GitHub

Natural Language Processing

A bilingual (English - Arabic) glossary for NLP terminology

Moroccan Darija Datasets

This benchmark aims to group Moroccan Darija Datasets to help make them available at once without the need to spend endless amounts of time looking for each dataset. We group the datasets by name, data source, region, and size to provide as much information as possible to select the best dataset for the task at hand.

Dataset Data source Region Size Link Reference
1 Moroccan Arabic Sentiment Analysis Corpus Twitter Maghrebi (Moroccan) 2000 entries source 2018 [1]
2 IADD: An integrated Arabic dialect identification dataset Varied Maghrebi, Levantine, Egyptian and Gulf 135,804 texts source 2022 [2]
3 Dialectal Arabic Datasets Twitter Maghrebi, Levantine, Egyptian and Gulf 350 tweets per region source 2018 [3]
4 MSDA Open Datasets Social media posts Arabic - source 2020 [4]
5 Moroccan Dialect Darija Open Dataset Open source contributions Maghrebi (Moroccan) More than 13K source 2021 [5]
6 Goud.ma: a News Dataset for Summarization in Moroccan Darija goud.ma Maghrebi (Moroccan) 158k news articles source 2022 [6]
7 MNAD : Moroccan News Articles Dataset Moroccan news websites Maghrebi (Moroccan) 418 563 documents source 2021 [7]
8 QADI: QCRI Arabic Dialect Identification Twitter Maghrebi, Levantine, Egyptian and Gulf 540k tweets source 2020 [8]
9 Dvoice : An open source dataset for Automatic Speech Recognition on Moroccan dialectal Arabic Voice recordings + text transcriptions Maghrebi (Moroccan) 2392 training and 600 testing files source 2021 [9]
10 ASAYAR: A Dataset for Arabic-Latin Scene Text Localization in Highway Traffic Panels Images collected on different Moroccan highways, annotated manually. Maghrebi (Moroccan) 1763 images source 2020 [10]
11 OMCD: Offensive Moroccan Comments Dataset A collection of comments from YouTube that have been labeled for offensive content. Maghrebi (Moroccan) 8024 comments written in Moroccan dialect source 2023 [11]
12 MORED: A Moroccan Buildings’ Electricity Consumption Dataset A dataset that comprises electricity consumption data of various Moroccan premises Maghrebi (Moroccan) - source 2020 [12]
13 DarNERcorp: a Named Entity Recognition Corpus in the Moroccan Dialect arNERcorp is a manually annotated corpus for Named Entity Recognition (NER) in the Moroccan Dialect or Darija Maghrebi (Moroccan) 65,905 tokens source 2023 [13]