Do all Wikipedia articles really lead to philosophy?

Using Wikimedia's API to explore its network of articles

Introduction

Rumour has it all articles on Wikipedia eventually lead to Philosophy. This phenomenon even has its own Wikipedia article.

I thought it would be fun to put this theory to a test. Read my findings in the post below!

In this section, I explain how I used the Wikipedia API to get the first links for each page. You can skip directly to the results if that’s what you’re interested in.

Using Wikimedia’s API

To access Wikipedia pages data, I used the Wikimedia API1. The API allowed me to query several Wikipedia pages, and get the first link for each of them.

To interact wit the API, I used the packages httr and jsonlite (to query the API and format results).

The code below sets up the basis for querying the API, using the credentials I got from this page. This is not necessary, but gets you more requests (5000 per hour).

1key <- Sys.getenv("API_KEY")
2language <- "en"
3
4url <- paste0("https://", language, ".wikipedia.org/w/api.php")
5header <- add_headers("Authorization" = paste("Bearer", key))

Getting a subset of articles

First, I’m going to query a subset of 100 starting Wikipedia articles to test the “philosophy hypothesis”. I will also limit the link search to 50 downstream articles before ending up on a loop (but as you’ll see later, this limit was never reached).

1n_articles <- 100
2chain_max <- 50

Now, let’s get our 100 random Wikipedia pages: for this, I perform a Query Action using the Random module.

 1# Get random Wikipedia articles
 2par <- list("action" = "query",
 3            "format" = "json",
 4            "list" = "random",
 5            "rnfilterredir" = "nonredirects",
 6            "rnnamespace" = 0,
 7            "rnlimit" = n_articles)
 8
 9# Format results
10res <- httr::GET(url, header,
11                 query = par)
12resj <- jsonlite::fromJSON(content(res, "text"), 
13                           flatten = TRUE)
14random_pages <- resj$query$random
15
16# Save results (because there is no seed in the query)
17saveRDS(random_pages, file = file.path(basepath, "data", "pages.rds"))

I used httr::GET to construct the query string using my API credentials, but you could also construct the query from scratch (here, it would be https://en.wikipedia.org/w/api.php?action=query&format=json&list=random&rnfilterredir=nonredirects&rnnamespace&rnlimit=100). Then, I used jsonlite::fromJSON to format the results as a data.frame in R.

Here are a few of these random pages:

1head(random_pages, n = 5)
        id ns                               title
1   564674  0      English football league system
2 15551349  0                            Stânceni
3 27639612  0 Guido de Bres Christian High School
4   164634  0                                Pune
5 44019719  0                            Invincea

This section performs the main part of our analysis: hopping from page to page through the first link.

First, I write a get_first_link function to get the first link from the text of a Wikipedia article. A few subtleties (followed from here) are:

  • I get links from paragraphs or bullet lists only (p or ul elements, excluding tables), to avoid infoboxes and other decorations;
  • I exclude links between parentheses, to discard languages links.

(Theoretically, I would also have to discard italicized links, but this case seems to be pretty rare and not likely to bias these results so I didn’t.)

Code
 1#' Get first link
 2#' 
 3#' Get first link from a Wikipedia article
 4#'
 5#' @param article_str String representation of the article (from parse query)
 6#' @param return_title Return the article title instead of the link?
 7#'
 8#' @returns If `return_title` is `TRUE`, returns the Wikipedia article title of the first link.
 9#' Else returns the first link of the text (in HTML format as `<a href="...">...</a>`)
10#' @export
11get_first_link <- function(article_str, return_title = TRUE) {
12  # Parse to HTML
13  article_html <- read_html(article_str)
14
15  # remove all tables (infobox)
16  xml2::xml_remove(rvest::html_nodes(article_html, "table"))
17
18  # Get all links pointing to a wiki
19  # (Exclude special pages beginning with xxx: (e.g. Help, Wikipedia:)))
20  
21  # First try with paragraphs
22  links <- article_html |> 
23    html_elements("p") |> 
24    html_elements("a") |> 
25    grep(pattern = "href=\"/wiki/(?![A-z]+:)", 
26         perl = TRUE, value = TRUE)
27  
28  # If no luck, try with ul
29  if (length(links) == 0) {
30    links <- article_html |> 
31      html_elements("ul") |> 
32      html_elements("a") |> 
33      grep(pattern = "href=\"/wiki/(?![A-z]+:)", 
34           perl = TRUE, value = TRUE)
35  }
36  
37  # Get first link
38  for (l in links) {
39    # Is the link parenthesized?
40    # Match opening parenthesis [text], link, [text], closing parenthesis
41    # Text is anything but parentheses
42    is_parenthesized <- grepl(pattern = paste0("\\([^()]*", l, "[^()]*\\)"), 
43                              x = article_html, perl = TRUE)
44    if (!is_parenthesized) {
45      # It's the first link without parentheses
46      res <- l
47      break
48    }
49    # Else, continue
50  }
51  
52  if (return_title) {
53    # Get corresponding page title
54    res <- gsub(".*href\\=\"/wiki/(\\S+)\".*", "\\1", res)
55  }
56  
57  return(res)
58}

Next, I iterate over each of the starting articles, following the first link, and the next, and the next… Until:

  • I end up on a loop (discovered in the current article or in a previous one)
  • Or I reach the upwards limit of links defined above (chain_max = 50)

To get articles’ first links, I use my custom get_first_link function on the articles’ text (obtained through Parse Actions). Running the code below takes about 10 minutes in my setup.

 1# Initialize links list
 2all_links <- vector(mode = "list", length = n_articles)
 3unique_links <- c()
 4
 5for (i in 1:n_articles) { # iterate over starting articles
 6  # Get starting article
 7  starting_page <- random_pages$title[i]
 8  
 9  message("Traversing links for article ",
10          starting_page, " (", i, "/", n_articles, 
11          ") ====================")
12  
13  # Initialize list
14  links_vec <- starting_page
15  
16  # Initialize search page
17  page <- starting_page
18  
19  for (j in 1:chain_max) {
20    message("Link #", j, " ---")
21    # Get Wikipedia article body
22    par <- list("action" = "parse",
23                "page" = page,
24                "format" = "json",
25                "redirects" = "",
26                "prop" = "text")
27  
28    res <- httr::GET(url, header,
29                     query = par)
30    resj <- jsonlite::fromJSON(content(res, "text"), 
31                               flatten = TRUE)
32    
33    # Extract links from article body
34    article_str <- resj$parse$text$`*`
35    
36    # Get first link
37    first_link <- get_first_link(article_str)
38    # Replace with spaces
39    first_link <- gsub(pattern = "_", replacement = " ", first_link)
40    # And decode URL for special characters (e.g. %E2%80%93)
41    first_link <- URLdecode(first_link)
42    
43    if (first_link %in% links_vec) {
44      message("Loop detected for '", starting_page, "' with '", 
45              first_link, "' : exiting loop")
46      # Store results before exiting
47      links_vec <- c(links_vec, first_link)
48      break
49    } else if (first_link %in% unique_links) {
50      message("Link ", first_link, " already detected: exiting loop")
51      # Store results before exiting
52      links_vec <- c(links_vec, first_link)
53      break
54    } else {
55      message("First link: ", first_link)
56      # Store results
57      links_vec <- c(links_vec, first_link)
58      # Update search page
59      page <- first_link
60    }
61  }
62  
63  # Add the article links chain to articles chains
64  all_links[[i]] <- links_vec
65  
66  # Get new links from last loop
67  new_links <- links_vec[1:(length(links_vec)-1)]
68  new_links <- new_links[which(!(new_links %in% unique_links))]
69  
70  # Get unique links
71  unique_links <- c(unique_links, new_links)
72}
73
74# Save results
75saveRDS(all_links, file = file.path(basepath, "data", "links.rds"))

Ultimately, this code produces a list of chains of links from article to article, where each chains stops when a loop has been detected.

1# See the first 3 link chains
2head(all_links, 3)
[[1]]
 [1] "English football league system" "League system"                 
 [3] "Hierarchy"                      "Ancient Greek language"        
 [5] "Greek language"                 "Indo-European language"        
 [7] "Language family"                "Language"                      
 [9] "Communication"                  "Information"                   
[11] "Abstraction"                    "Rule of inference"             
[13] "Premise"                        "Proposition"                   
[15] "Meaning (philosophy)"           "Philosophy of language"        
[17] "Philosophy"                     "Existence"                     
[19] "Reality"                        "Everything"                    
[21] "Antithesis"                     "Proposition"                   

[[2]]
[1] "Stânceni"         "Mureș County"     "Romania"          "Southeast Europe"
[5] "Sub-region"       "Region"           "Geography"        "Ancient Greek"   
[9] "Greek language"  

[[3]]
 [1] "Guido de Bres Christian High School" "Hamilton, Ontario"                  
 [3] "Provinces and territories of Canada" "Canada"                             
 [5] "North America"                       "Continent"                          
 [7] "Convention (norm)"                   "Social norm"                        
 [9] "Acceptance"                          "Psychology"                         
[11] "Mind"                                "Thought"                            
[13] "Cognition"                           "Knowledge"                          
[15] "Declarative knowledge"               "Awareness"                          
[17] "Philosophy"                         

Results

And now, let’s get to the part we’ve all been waiting for: do all articles really lead to “Philosophy”?

First, I format the results to a network object using the igraph package.

Code
1# Format output for network
2nk_list <- lapply(all_links, function(l) {
3  cbind(l[1:(length(l)-1)], l[2:length(l)])})
4nk <- do.call("rbind", nk_list)
5
6# Create graph
7g <- igraph::graph_from_edgelist(nk, directed = TRUE)

The next step is to reconstruct the chain of links for each starting article. Because of the way I coded the query, some articles stop before reaching their loop (because the loop was explored from another one), so I simply reconstruct the loops in the code below.

Code
 1# Get all starting articles
 2starting_nodes <- sapply(all_links, function(l) l[1])
 3
 4# Initialize list
 5complete_paths <- vector(mode = "list", 
 6                         length = length(starting_nodes))
 7
 8for (i in 1:length(starting_nodes)) {
 9  # Get all simple paths (excluding loops)
10  simple_paths <- all_simple_paths(from = starting_nodes[i],
11                                   g, mode = "out")
12  # Get the longest
13  longest_path_ind <- which.max(sapply(simple_paths, length))
14  longest_path <- simple_paths[[longest_path_ind]]
15  
16  # Repeat the last vertex to know where the loop starts
17  last_vertex <- longest_path[length(longest_path)]
18  loop_vertex <- neighbors(g, last_vertex)
19  
20  # Create final path
21  longest_path <- c(longest_path, loop_vertex)
22  longest_path <- longest_path$name
23  
24  complete_paths[[i]] <- longest_path
25}

In our first exploration of the results, let’s investigate the length of links chains. This gives us how many articles are visited before entering a loop.

Code
 1# Get mean path length (without loop)
 2before_loop <- lapply(complete_paths, 
 3                      function(l) {
 4                        dup <- which(l == l[duplicated(l)])
 5                        l[1:min(dup)]
 6                        })
 7
 8link_length <- sapply(before_loop, length)
 9mean_length <- mean(link_length)
10
11ggplot() +
12  geom_histogram(aes(x = link_length),
13                 binwidth = 1,
14                 fill = "#8FC4BD") +
15  geom_vline(aes(xintercept = mean_length), 
16             linetype = "dashed", color = "grey50") +
17  ylab("Count") +
18  xlab("Article chains length") +
19  scale_y_continuous(expand = expansion(mult = c(0, .1))) +
20  theme_bw() +
21  theme(plot.background = element_rect(fill = "transparent", color = NA),
22        panel.background = element_rect(fill = "transparent", 
23                                        color = "grey50", linewidth = 1),
24        panel.grid = element_blank(),
25        text = element_text(color = "grey70"),
26        axis.text = element_text(color = "grey70"),
27        axis.ticks = element_line(color = "grey70"))
Distribution of articles chain lengths for the 100 random articles.

Here, the mean links chain is 17.14, and doesn’t go over 31.

And for the long-awaited result: Where do articles end up?

Code
 1# Get the ending links for all articles
 2list_length <- sapply(complete_paths, length)
 3end_links <- sapply(seq_along(complete_paths), 
 4                    function(i) complete_paths[[i]][list_length[i]])
 5
 6end_links_df <- data.frame(article = names(table(end_links)),
 7                           prop = as.numeric(table(end_links))/n_articles*100)
 8end_links_df$article[end_links_df$article == "List of ecumenical patriarchs of Constantinople"] <- "List of ecumenical patriarchs\nof Constantinople"
 9
10end_links_df[end_links_df$article == "Philosophy", ]
     article prop
6 Philosophy   32
Code
 1ggplot(end_links_df, aes(x = prop,
 2                         y = reorder(article, prop))) +
 3  geom_col(fill = "#8FC4BD") +
 4  geom_text(aes(label = paste0(prop, "%")), 
 5                color = "grey70",
 6                hjust = 0, nudge_x = 0.8) +
 7  xlab("Proportion (%)") +
 8  theme_void() +
 9  scale_x_continuous(expand = expansion(mult = c(0, .1))) +
10  theme(axis.title.y = element_blank()) +
11  theme(plot.background = element_rect(fill = "transparent", color = NA),
12        text = element_text(color = "grey70"),
13        axis.text.y = element_text(color = "grey70"),
14        axis.text.x = element_blank(),
15        axis.ticks = element_blank())
Repartition of ending articles. An article is considered as the end if it eventually loops on itself (and is the first of the loop).

Per this graph, 32% of articles from my subset end up on philosophy: that is, they reach a loop on the “Philosophy” article (which loops on itself eventually). But most articles end up on “Meaning (philosophy)”, and a sizeable portion also end up on “Proposition”. So, was this Philosophy thing a bit of an oversell?

The graph below shows the full story: in fact, a vast majority of articles end up on a loop containing “Philosophy”. Blue dots show starting articles, and red dots end articles. You can hover articles to see their name.

Code
 1# Prepare data for plot ---
 2# Set node type (start, end or none)
 3vertices <- V(g)$name
 4ind_start <- match(starting_nodes, vertices)
 5ind_end <- match(end_links, vertices)
 6
 7type <- rep("none", length(V(g)))
 8type[ind_start] <- "start"
 9type[ind_end] <- "end"
10
11V(g)$type <- type
12
13# Set degree attribute
14V(g)$degree <- degree(g, mode = "in")
15
16# Get mutual edges (to invert curvature)
17mutual <- which_mutual(g)
18curvature <- ifelse(mutual, 0.5, 0)
19
20# Plot graph ---
21lay <- create_layout(g, "stress",
22                     bbox = 10)
23gg <- ggraph(lay) +
24  geom_edge_arc(strength = curvature, color = "grey70") +
25  geom_point_interactive(aes(x = x, y = y, size = degree,
26                             color = type, tooltip = name), 
27                         show.legend = FALSE) +
28  scale_size(range = c(1, 3)) +
29  scale_color_manual(values = c("start" = "cornflowerblue", 
30                                "none" = "grey70",
31                                "end" = "darkred")) +
32  theme_void() +
33  theme(plot.margin = margin(t = 10, r = 10, b = 10, l = 10),
34        plot.background = element_rect(fill='transparent', color=NA))
35
36girafe(ggobj = gg, bg = "transparent")
Wikipedia's network of first links, starting from 100 random pages (in blue). Red pages represent the end pages (before entering a loop).
1# Number of articles containing "Philosophy"
2n_philo <- sum(sapply(complete_paths, function(p) "Philosophy" %in% p))
3n_philo
[1] 96

In my small example, 96% of articles reach the “Philosophy” loop (but Wikipedia itself gives 97% in 2016). This loop is reached through 3 main paths. Upon closer inspection of these articles, I’l call them the “psychology” branch (left), the “Logic” branch (top) and the “Science” branch (right).

Conclusion

It was a fun project that taught me to interact with Wikimedia’s API and refreshed my network analyses skills.

But it also prompts other questions: for instance, what happens in other languages? Can we cluster articles by themes, or how is it different when we consider all links instead of only the first? Hopefully, I’ll be able to explore this in a next blog post!

Source code

You can view the source code for this blog post (embedded in a Quarto document). Check it out if you’d like to know more about the behind-the-scenes!


  1. The API is undergoing important changes in 2026, so apologies if any links indicated in this post are not working if you read this from the future ↩︎