Do all Wikipedia articles really lead to philosophy?
Using Wikimedia's API to explore its network of articles
Introduction
Rumour has it all articles on Wikipedia eventually lead to Philosophy. This phenomenon even has its own Wikipedia article.
I thought it would be fun to put this theory to a test. Read my findings in the post below!
Getting first links
In this section, I explain how I used the Wikipedia API to get the first links for each page. You can skip directly to the results if that’s what you’re interested in.
Using Wikimedia’s API
To access Wikipedia pages data, I used the Wikimedia API1. The API allowed me to query several Wikipedia pages, and get the first link for each of them.
To interact wit the API, I used the packages httr and jsonlite (to query the API and format results).
The code below sets up the basis for querying the API, using the credentials I got from this page. This is not necessary, but gets you more requests (5000 per hour).
1key <- Sys.getenv("API_KEY")
2language <- "en"
3
4url <- paste0("https://", language, ".wikipedia.org/w/api.php")
5header <- add_headers("Authorization" = paste("Bearer", key))
Getting a subset of articles
First, I’m going to query a subset of 100 starting Wikipedia articles to test the “philosophy hypothesis”. I will also limit the link search to 50 downstream articles before ending up on a loop (but as you’ll see later, this limit was never reached).
1n_articles <- 100
2chain_max <- 50
Now, let’s get our 100 random Wikipedia pages: for this, I perform a Query Action using the Random module.
1# Get random Wikipedia articles
2par <- list("action" = "query",
3 "format" = "json",
4 "list" = "random",
5 "rnfilterredir" = "nonredirects",
6 "rnnamespace" = 0,
7 "rnlimit" = n_articles)
8
9# Format results
10res <- httr::GET(url, header,
11 query = par)
12resj <- jsonlite::fromJSON(content(res, "text"),
13 flatten = TRUE)
14random_pages <- resj$query$random
15
16# Save results (because there is no seed in the query)
17saveRDS(random_pages, file = file.path(basepath, "data", "pages.rds"))
I used httr::GET to construct the query string using my API credentials, but you could also construct the query from scratch (here, it would be https://en.wikipedia.org/w/api.php?action=query&format=json&list=random&rnfilterredir=nonredirects&rnnamespace&rnlimit=100). Then, I used jsonlite::fromJSON to format the results as a data.frame in R.
Here are a few of these random pages:
1head(random_pages, n = 5)
id ns title
1 564674 0 English football league system
2 15551349 0 Stânceni
3 27639612 0 Guido de Bres Christian High School
4 164634 0 Pune
5 44019719 0 Invincea
Traversing links chains
This section performs the main part of our analysis: hopping from page to page through the first link.
First, I write a get_first_link function to get the first link from the text of a Wikipedia article. A few subtleties (followed from here) are:
- I get links from paragraphs or bullet lists only (
porulelements, excluding tables), to avoid infoboxes and other decorations; - I exclude links between parentheses, to discard languages links.
(Theoretically, I would also have to discard italicized links, but this case seems to be pretty rare and not likely to bias these results so I didn’t.)
Code
1#' Get first link
2#'
3#' Get first link from a Wikipedia article
4#'
5#' @param article_str String representation of the article (from parse query)
6#' @param return_title Return the article title instead of the link?
7#'
8#' @returns If `return_title` is `TRUE`, returns the Wikipedia article title of the first link.
9#' Else returns the first link of the text (in HTML format as `<a href="...">...</a>`)
10#' @export
11get_first_link <- function(article_str, return_title = TRUE) {
12 # Parse to HTML
13 article_html <- read_html(article_str)
14
15 # remove all tables (infobox)
16 xml2::xml_remove(rvest::html_nodes(article_html, "table"))
17
18 # Get all links pointing to a wiki
19 # (Exclude special pages beginning with xxx: (e.g. Help, Wikipedia:)))
20
21 # First try with paragraphs
22 links <- article_html |>
23 html_elements("p") |>
24 html_elements("a") |>
25 grep(pattern = "href=\"/wiki/(?![A-z]+:)",
26 perl = TRUE, value = TRUE)
27
28 # If no luck, try with ul
29 if (length(links) == 0) {
30 links <- article_html |>
31 html_elements("ul") |>
32 html_elements("a") |>
33 grep(pattern = "href=\"/wiki/(?![A-z]+:)",
34 perl = TRUE, value = TRUE)
35 }
36
37 # Get first link
38 for (l in links) {
39 # Is the link parenthesized?
40 # Match opening parenthesis [text], link, [text], closing parenthesis
41 # Text is anything but parentheses
42 is_parenthesized <- grepl(pattern = paste0("\\([^()]*", l, "[^()]*\\)"),
43 x = article_html, perl = TRUE)
44 if (!is_parenthesized) {
45 # It's the first link without parentheses
46 res <- l
47 break
48 }
49 # Else, continue
50 }
51
52 if (return_title) {
53 # Get corresponding page title
54 res <- gsub(".*href\\=\"/wiki/(\\S+)\".*", "\\1", res)
55 }
56
57 return(res)
58}
Next, I iterate over each of the starting articles, following the first link, and the next, and the next… Until:
- I end up on a loop (discovered in the current article or in a previous one)
- Or I reach the upwards limit of links defined above (
chain_max= 50)
To get articles’ first links, I use my custom get_first_link function on the articles’ text (obtained through Parse Actions). Running the code below takes about 10 minutes in my setup.
1# Initialize links list
2all_links <- vector(mode = "list", length = n_articles)
3unique_links <- c()
4
5for (i in 1:n_articles) { # iterate over starting articles
6 # Get starting article
7 starting_page <- random_pages$title[i]
8
9 message("Traversing links for article ",
10 starting_page, " (", i, "/", n_articles,
11 ") ====================")
12
13 # Initialize list
14 links_vec <- starting_page
15
16 # Initialize search page
17 page <- starting_page
18
19 for (j in 1:chain_max) {
20 message("Link #", j, " ---")
21 # Get Wikipedia article body
22 par <- list("action" = "parse",
23 "page" = page,
24 "format" = "json",
25 "redirects" = "",
26 "prop" = "text")
27
28 res <- httr::GET(url, header,
29 query = par)
30 resj <- jsonlite::fromJSON(content(res, "text"),
31 flatten = TRUE)
32
33 # Extract links from article body
34 article_str <- resj$parse$text$`*`
35
36 # Get first link
37 first_link <- get_first_link(article_str)
38 # Replace with spaces
39 first_link <- gsub(pattern = "_", replacement = " ", first_link)
40 # And decode URL for special characters (e.g. %E2%80%93)
41 first_link <- URLdecode(first_link)
42
43 if (first_link %in% links_vec) {
44 message("Loop detected for '", starting_page, "' with '",
45 first_link, "' : exiting loop")
46 # Store results before exiting
47 links_vec <- c(links_vec, first_link)
48 break
49 } else if (first_link %in% unique_links) {
50 message("Link ", first_link, " already detected: exiting loop")
51 # Store results before exiting
52 links_vec <- c(links_vec, first_link)
53 break
54 } else {
55 message("First link: ", first_link)
56 # Store results
57 links_vec <- c(links_vec, first_link)
58 # Update search page
59 page <- first_link
60 }
61 }
62
63 # Add the article links chain to articles chains
64 all_links[[i]] <- links_vec
65
66 # Get new links from last loop
67 new_links <- links_vec[1:(length(links_vec)-1)]
68 new_links <- new_links[which(!(new_links %in% unique_links))]
69
70 # Get unique links
71 unique_links <- c(unique_links, new_links)
72}
73
74# Save results
75saveRDS(all_links, file = file.path(basepath, "data", "links.rds"))
Ultimately, this code produces a list of chains of links from article to article, where each chains stops when a loop has been detected.
1# See the first 3 link chains
2head(all_links, 3)
[[1]]
[1] "English football league system" "League system"
[3] "Hierarchy" "Ancient Greek language"
[5] "Greek language" "Indo-European language"
[7] "Language family" "Language"
[9] "Communication" "Information"
[11] "Abstraction" "Rule of inference"
[13] "Premise" "Proposition"
[15] "Meaning (philosophy)" "Philosophy of language"
[17] "Philosophy" "Existence"
[19] "Reality" "Everything"
[21] "Antithesis" "Proposition"
[[2]]
[1] "Stânceni" "Mureș County" "Romania" "Southeast Europe"
[5] "Sub-region" "Region" "Geography" "Ancient Greek"
[9] "Greek language"
[[3]]
[1] "Guido de Bres Christian High School" "Hamilton, Ontario"
[3] "Provinces and territories of Canada" "Canada"
[5] "North America" "Continent"
[7] "Convention (norm)" "Social norm"
[9] "Acceptance" "Psychology"
[11] "Mind" "Thought"
[13] "Cognition" "Knowledge"
[15] "Declarative knowledge" "Awareness"
[17] "Philosophy"
Results
And now, let’s get to the part we’ve all been waiting for: do all articles really lead to “Philosophy”?
First, I format the results to a network object using the igraph package.
Code
1# Format output for network
2nk_list <- lapply(all_links, function(l) {
3 cbind(l[1:(length(l)-1)], l[2:length(l)])})
4nk <- do.call("rbind", nk_list)
5
6# Create graph
7g <- igraph::graph_from_edgelist(nk, directed = TRUE)
The next step is to reconstruct the chain of links for each starting article. Because of the way I coded the query, some articles stop before reaching their loop (because the loop was explored from another one), so I simply reconstruct the loops in the code below.
Code
1# Get all starting articles
2starting_nodes <- sapply(all_links, function(l) l[1])
3
4# Initialize list
5complete_paths <- vector(mode = "list",
6 length = length(starting_nodes))
7
8for (i in 1:length(starting_nodes)) {
9 # Get all simple paths (excluding loops)
10 simple_paths <- all_simple_paths(from = starting_nodes[i],
11 g, mode = "out")
12 # Get the longest
13 longest_path_ind <- which.max(sapply(simple_paths, length))
14 longest_path <- simple_paths[[longest_path_ind]]
15
16 # Repeat the last vertex to know where the loop starts
17 last_vertex <- longest_path[length(longest_path)]
18 loop_vertex <- neighbors(g, last_vertex)
19
20 # Create final path
21 longest_path <- c(longest_path, loop_vertex)
22 longest_path <- longest_path$name
23
24 complete_paths[[i]] <- longest_path
25}
In our first exploration of the results, let’s investigate the length of links chains. This gives us how many articles are visited before entering a loop.
Code
1# Get mean path length (without loop)
2before_loop <- lapply(complete_paths,
3 function(l) {
4 dup <- which(l == l[duplicated(l)])
5 l[1:min(dup)]
6 })
7
8link_length <- sapply(before_loop, length)
9mean_length <- mean(link_length)
10
11ggplot() +
12 geom_histogram(aes(x = link_length),
13 binwidth = 1,
14 fill = "#8FC4BD") +
15 geom_vline(aes(xintercept = mean_length),
16 linetype = "dashed", color = "grey50") +
17 ylab("Count") +
18 xlab("Article chains length") +
19 scale_y_continuous(expand = expansion(mult = c(0, .1))) +
20 theme_bw() +
21 theme(plot.background = element_rect(fill = "transparent", color = NA),
22 panel.background = element_rect(fill = "transparent",
23 color = "grey50", linewidth = 1),
24 panel.grid = element_blank(),
25 text = element_text(color = "grey70"),
26 axis.text = element_text(color = "grey70"),
27 axis.ticks = element_line(color = "grey70"))
Here, the mean links chain is 17.14, and doesn’t go over 31.
And for the long-awaited result: Where do articles end up?
Code
1# Get the ending links for all articles
2list_length <- sapply(complete_paths, length)
3end_links <- sapply(seq_along(complete_paths),
4 function(i) complete_paths[[i]][list_length[i]])
5
6end_links_df <- data.frame(article = names(table(end_links)),
7 prop = as.numeric(table(end_links))/n_articles*100)
8end_links_df$article[end_links_df$article == "List of ecumenical patriarchs of Constantinople"] <- "List of ecumenical patriarchs\nof Constantinople"
9
10end_links_df[end_links_df$article == "Philosophy", ]
article prop
6 Philosophy 32
Code
1ggplot(end_links_df, aes(x = prop,
2 y = reorder(article, prop))) +
3 geom_col(fill = "#8FC4BD") +
4 geom_text(aes(label = paste0(prop, "%")),
5 color = "grey70",
6 hjust = 0, nudge_x = 0.8) +
7 xlab("Proportion (%)") +
8 theme_void() +
9 scale_x_continuous(expand = expansion(mult = c(0, .1))) +
10 theme(axis.title.y = element_blank()) +
11 theme(plot.background = element_rect(fill = "transparent", color = NA),
12 text = element_text(color = "grey70"),
13 axis.text.y = element_text(color = "grey70"),
14 axis.text.x = element_blank(),
15 axis.ticks = element_blank())
Per this graph, 32% of articles from my subset end up on philosophy: that is, they reach a loop on the “Philosophy” article (which loops on itself eventually). But most articles end up on “Meaning (philosophy)”, and a sizeable portion also end up on “Proposition”. So, was this Philosophy thing a bit of an oversell?
The graph below shows the full story: in fact, a vast majority of articles end up on a loop containing “Philosophy”. Blue dots show starting articles, and red dots end articles. You can hover articles to see their name.
Code
1# Prepare data for plot ---
2# Set node type (start, end or none)
3vertices <- V(g)$name
4ind_start <- match(starting_nodes, vertices)
5ind_end <- match(end_links, vertices)
6
7type <- rep("none", length(V(g)))
8type[ind_start] <- "start"
9type[ind_end] <- "end"
10
11V(g)$type <- type
12
13# Set degree attribute
14V(g)$degree <- degree(g, mode = "in")
15
16# Get mutual edges (to invert curvature)
17mutual <- which_mutual(g)
18curvature <- ifelse(mutual, 0.5, 0)
19
20# Plot graph ---
21lay <- create_layout(g, "stress",
22 bbox = 10)
23gg <- ggraph(lay) +
24 geom_edge_arc(strength = curvature, color = "grey70") +
25 geom_point_interactive(aes(x = x, y = y, size = degree,
26 color = type, tooltip = name),
27 show.legend = FALSE) +
28 scale_size(range = c(1, 3)) +
29 scale_color_manual(values = c("start" = "cornflowerblue",
30 "none" = "grey70",
31 "end" = "darkred")) +
32 theme_void() +
33 theme(plot.margin = margin(t = 10, r = 10, b = 10, l = 10),
34 plot.background = element_rect(fill='transparent', color=NA))
35
36girafe(ggobj = gg, bg = "transparent")
1# Number of articles containing "Philosophy"
2n_philo <- sum(sapply(complete_paths, function(p) "Philosophy" %in% p))
3n_philo
[1] 96
In my small example, 96% of articles reach the “Philosophy” loop (but Wikipedia itself gives 97% in 2016). This loop is reached through 3 main paths. Upon closer inspection of these articles, I’l call them the “psychology” branch (left), the “Logic” branch (top) and the “Science” branch (right).
Conclusion
It was a fun project that taught me to interact with Wikimedia’s API and refreshed my network analyses skills.
But it also prompts other questions: for instance, what happens in other languages? Can we cluster articles by themes, or how is it different when we consider all links instead of only the first? Hopefully, I’ll be able to explore this in a next blog post!
Source code
You can view the source code for this blog post (embedded in a Quarto document). Check it out if you’d like to know more about the behind-the-scenes!
The API is undergoing important changes in 2026, so apologies if any links indicated in this post are not working if you read this from the future ↩︎