Summary of Wikipedia data sources by format, update frequency, data quantity, type, and challenges
. | Content . | Access . | Format . | Update frequency . | Data quantity* . | Type** . | Main challenge*** . |
---|---|---|---|---|---|---|---|
Wikimedia Dumps | Metadata, page content, and relationships | Offline | XML, SQL | Once/twice a month | Big data | General | Data processing |
MediaWiki and Wikimedia APIs | Metadata, page content, relationships, and statistics | Online | JSON, WDDX, XML, YAML, PHP | Real time | Small data | General | Data recovery |
Wiki Replicas | Metadata, page content, and relationships | Online | SQL | Near-real time | Small data | General | Data recovery |
Event Streams | Real-time logs | Online | SSE, JSON | Real time | – | Specific | Data recovery |
Analytics dumps | Statistics on page views and activity | Offline | TSV | Monthly | Big data | Specific | Data processing |
WikiStats | Statistics on page views, content, and activity | Online | JSON/CSV | Monthly | Small data | Specific | Data recovery |
Dbpedia | Contents and semantic relationships | Both | RDF/XML, Turtle, N-Triplets, SPARQL endpoint | Live/monthly | – | General | Data recovery |
XTools | Statistics on page views, content, and activity | Online | JSON | Real time | Small data | Specific | Data recovery |
Repositories | Dedicated Wikipedia data sets | Offline | – | – | – | – | – |
Altmetric aggregators | Wikipedia References to publications | Online | CSV/JSON | Daily | – | Specific | Data processing |
. | Content . | Access . | Format . | Update frequency . | Data quantity* . | Type** . | Main challenge*** . |
---|---|---|---|---|---|---|---|
Wikimedia Dumps | Metadata, page content, and relationships | Offline | XML, SQL | Once/twice a month | Big data | General | Data processing |
MediaWiki and Wikimedia APIs | Metadata, page content, relationships, and statistics | Online | JSON, WDDX, XML, YAML, PHP | Real time | Small data | General | Data recovery |
Wiki Replicas | Metadata, page content, and relationships | Online | SQL | Near-real time | Small data | General | Data recovery |
Event Streams | Real-time logs | Online | SSE, JSON | Real time | – | Specific | Data recovery |
Analytics dumps | Statistics on page views and activity | Offline | TSV | Monthly | Big data | Specific | Data processing |
WikiStats | Statistics on page views, content, and activity | Online | JSON/CSV | Monthly | Small data | Specific | Data recovery |
Dbpedia | Contents and semantic relationships | Both | RDF/XML, Turtle, N-Triplets, SPARQL endpoint | Live/monthly | – | General | Data recovery |
XTools | Statistics on page views, content, and activity | Online | JSON | Real time | Small data | Specific | Data recovery |
Repositories | Dedicated Wikipedia data sets | Offline | – | – | – | – | – |
Altmetric aggregators | Wikipedia References to publications | Online | CSV/JSON | Daily | – | Specific | Data processing |