Skip to content

Get list of all wikipedia articles

The goal of this tutorial is to get a list of all wikipedia article URL's

Download a Zim File of all of wikipedia

You can download from here or check out Where to download wikipedia? for more info.

Install zim-tools

`


sudo apt-get install zim-tools

Use zim-tools to list all paths in Zim File


zimdump list  wikipedia_*.zim > list_wikipedia_articles.dump

Use python script to only list articles

# Define the input file and output file names
input_file_name = '/home/dentropy/Projects/wikipedia-article-names/list_wikipedia_articles.dump'
output_file_name = 'articles_names.txt'

# Open the input file in read mode and the output file in write mode
with open(input_file_name, 'r') as input_file, open(output_file_name, 'w') as output_file:
    for line in input_file:
        if line.startswith('A'):
            # If the line starts with 'A', append it to the output file
            output_file.write(line[2:])

Sources