Moroccan Darija Datasets
This benchmark aims to group Moroccan Darija Datasets to help make them available at once without the need to spend endless amounts of time looking for each dataset. We group the datasets by name, data source, region, and size to provide as much information as possible to select the best dataset for the task at hand.
Dataset | Data source | Region | Size | Link | Reference | |
---|---|---|---|---|---|---|
1 | Moroccan Arabic Sentiment Analysis Corpus | Maghrebi (Moroccan) | 2000 entries | source | 2018 [1] | |
2 | IADD: An integrated Arabic dialect identification dataset | Varied | Maghrebi, Levantine, Egyptian and Gulf | 135,804 texts | source | 2022 [2] |
3 | Dialectal Arabic Datasets | Maghrebi, Levantine, Egyptian and Gulf | 350 tweets per region | source | 2018 [3] | |
4 | MSDA Open Datasets | Social media posts | Arabic | - | source | 2020 [4] |
5 | Moroccan Dialect Darija Open Dataset | Open source contributions | Maghrebi (Moroccan) | More than 13K | source | 2021 [5] |
6 | Goud.ma: a News Dataset for Summarization in Moroccan Darija | goud.ma | Maghrebi (Moroccan) | 158k news articles | source | 2022 [6] |
7 | MNAD : Moroccan News Articles Dataset | Moroccan news websites | Maghrebi (Moroccan) | 418 563 documents | source | 2021 [7] |
8 | QADI: QCRI Arabic Dialect Identification | Maghrebi, Levantine, Egyptian and Gulf | 540k tweets | source | 2020 [8] | |
9 | Dvoice : An open source dataset for Automatic Speech Recognition on Moroccan dialectal Arabic | Voice recordings + text transcriptions | Maghrebi (Moroccan) | 2392 training and 600 testing files | source | 2021 [9] |
10 | ASAYAR: A Dataset for Arabic-Latin Scene Text Localization in Highway Traffic Panels | Images collected on different Moroccan highways, annotated manually. | Maghrebi (Moroccan) | 1763 images | source | 2020 [10] |
11 | OMCD: Offensive Moroccan Comments Dataset | A collection of comments from YouTube that have been labeled for offensive content. | Maghrebi (Moroccan) | 8024 comments written in Moroccan dialect | source | 2023 [11] |
12 | MORED: A Moroccan Buildings’ Electricity Consumption Dataset | A dataset that comprises electricity consumption data of various Moroccan premises | Maghrebi (Moroccan) | - | source | 2020 [12] |
13 | DarNERcorp: a Named Entity Recognition Corpus in the Moroccan Dialect | arNERcorp is a manually annotated corpus for Named Entity Recognition (NER) in the Moroccan Dialect or Darija | Maghrebi (Moroccan) | 65,905 tokens | source | 2023 [13] |