Categories
Uncategorized

Extension of Youtube Music Graph study

Extended analysis of YouTube Music recommendation algorithm

In my previous post, I discussed how the YouTube Music recommendation algorithm worked. As we have learned more properties about social networks, I will extend my analysis of the graph.

As a reminder, in my previous post I analyzed the recommendation algorithm of YouTube Music by going through all the songs in my liked songs, getting recommendations for them, and then creating a graph from artist A to artist B if artist A’s song got recommended artist B’s song. This resulted in a directed weighted graph where the weights were the amount of recommendations between the artists. The overall analysis showed that while most songs were recommended similar songs with many of the popular artists being recommended many times, each recommendation had a variety of artists which resulted in many artists being added to the graph.

Before we start this extended analysis some per-processing of the data must be done. As mentioned in the last post, about 3.5% of the nodes are actually disconnected from the main graph. Because most of the new analysis involves strictly connected graphs (balance, triadic closure, page rank, etc…) I will first remove that part of the graph. I will also create two more adjacency matrices:

  1. A undirected signed/ties matrix. Based on the weight distribution attached in the previous post, I have determined that the tail of the power law lies around 5. Therefore, an edge is considered negative if it does not already exist in the graph (meaning there are 0 recommendations between the artists), and positive edges are further divided into weak and strong, based on if the weight is greater than 5. Note that because the original matrix is directed, the value at each node is the truer/maximum of both directions.
  2. A directed normalized matrix where the sum of the weights of the outgoing edges for each node is 1. This is the PageRank matrix, normalized based on the outgoing weights.

In this dataset we have a total of 519 artists. For a complete graph, this would give us 23,165,219 total triangles. Given the sheer magnitude of that, I have decided to work only on the liked artists (the ones who were in my liked songs and used as the seeds for the recommendations) of which there 105 for triad analysis, giving us 187,460 triangles. Based on the signed/ties matrix, we get 81558 (44%) balanced triangles. There are two ways to get unbalanced triangles

  • No positive edges, which are caused by completely unconnected artists. These are common as shown by the previous post since only 1.75% of possible connections exist in this graph.
  • Two positive edges. These are most commonly found when only one of the nodes is a popular artist, since popular artists most likely can connect to all the nodes, but a normal artists might not be able to connect to the other normal artist.

As we can see, less than half of the triangles are balanced, emphasizing the variety of recommended artists which causes the total number of connections to be small.

We also have only 386 triadic closures, 65% of the triads with 2 strong edges. This tells us that if two artists are greatly recommended a third artist, there is more than 50% chance for both of those artists to be recommended each other. This shows how similar artists are more likely to be recommended each other, as if two artists are greatly recommended the same artist, they are all likely to be in the same genre.

For the page rank algorithm, we first initialize the pages as a column vector with each page having an initial rank of 1/N, and set a scaling factor of 90%. Following the formula of (N^T)p *.9+ .1*sum(p)/N and then normalization, we can iterate to improve the ranks of the pages. For this matrix, 20 steps have been chosen as the number of iterations as after that there seems to be little changes. Inspecting the final iteration, we can see that most pages have a rank really close to 0, giving us an average of 0.0019, even though the maximum final page rank is around 0.11. There are only 48 (9%) of the artists that are above the average, clearly showing the commonly known real world phenomena of musician stardom. We can clearly see how a very small group of musicians can get extremely popular and therefore hold the vast majority of fame (rank) while most artists have no fame at all. To this effect, the most popular artist is The Arctic Monkeys holding 11% of the rank by themselves, equal to the 393(76%) bottom individuals combined.  Comparing this to the leading eigenvector, we can see that there are less nodes at 0, and more higher value nodes. However, this does not include the scaling factor, which may have contributed to boosting the highest ranked pages.

This new analysis emphasizes the discoveries made in the previous post about the YouTube algorithm. But it also unexpectedly provides a good mirror for the inner workings of musician stardom, and how it is governed by power laws. The tradic closure property showed us how a popular artist can increase connections between similar artists, acting as a beacon. The property of balance showed us how even though we had a sparse graph connection wise, we were still able to get almost half of the graph balanced, mainly for the fact that two popular artists in the same triad can greatly increase the likelihood of balance since popular artists have many (positive) edges to other nodes. And finally page rank truly shows how fame and stardom are distributed between musicians, namely in a power law distribution as seen by the log-log graph. The heavily tail property of power laws perfectly describes how a few artists can have more fame than the vast majority of artists combined.

Again, I have tried to upload the relevant material here but I keep getting errors so I will post it in the piazza in note @30.

Leave a Reply