Collaboration on Social Media: Analyzing Successful projects on sites like GitHub

Github-dataset

http://arxiv.org/abs/1408.6012

The paper by researchers in Japan focused on finding key drivers for the success of projects on social coding sites like GitHub. The team has collected data on more than  300 000 projects (both personal and enterprise ones) through GitHub archive and Github API. The data was collected from February 2011 to May 2013, eliminating projects with less than 30 commits and copies of projects.

The researchers emphasized mainly  3 factors:
1. Team Structure
2. Social activity with external developers (follows, bookmarks)
3. The developed products

I guess you don’t need an introduction about sites like Github that provide the environment for showcasing the developers projects, the history of the projects, etc. On those sites you can either publish project develop by a single person or by a team, but a system has been set up for bug reporting, requests, improvements, etc.

The researchers have used social network analysis and especially correlation analysis in order to discern relationship between team structure and success. A topic model is employed to understand based on README files the characteristics of successful projects.

To establish the success of the project, the paper defines success indexes like: activity (based on update frequency, requests for changes and bugs), popularity and sociality (based on shares, bookmarks and updates from external developers).

Analyzing the number of commits added to the projects, number of stars that the project has received (developers give start to projects of interests) as well as number of pull requests (# of times external developers give back code for acceptance to the project) are used as a basis for popularity of the projects. Those three groups are used also to establish the frequency of project updates, projects followed by external developers and the project growth.

Through analysis of those topics the researcher have established that for over 50% of the project there is no start of the project when it comes to pull requests, as well as over 70% of the project with no pull requests whatsoever, confirming that most projects have either not been found or evaluated by other developers.

Another topic for analysis has been the longetivity of a project and since there is no exact timeline for termination of a project, they have measured the # of months needed to exceed 90% of total commits for each project. The data showcased that 20% of the projects die within a month, 40% in the first three months and around 20% continue well above the 20 months timeline.

As far the teams of developers for projects are concerned – the degree distribution of the network has been taken into account, establishing that GitHub has relatively small in-degree values, maybe due to the fact that there is no famous people commiting projects in GitHub, like they are tweeting in Twitter.

Important characteristic for whether the group and therefore the project will be successful is also related to the distance between members of a group – well connected groups will exhibit increased activity, popularity of a project and sociality.

Interesting finding is related to the reciprocity between developers, especially given the fact that it is more likely for successful projects to have strong connectivity between internal project members.

Furthermore, the analysis is unable to shed light on the optimal size of a group of developers, since is highly subjective on the work beeing performed. But the researchers have found that the team size affects project quality and number of commits byeach member. It turns out that teams with over 60 members suffer from decrease in efficiency, while from 2 to 60 members – the efficiency remains constant. Still a team from a solo member is app. 2 times more efficient than a team with 2-10 members.

Github-dataset1

http://arxiv.org/abs/1408.6012

Attention is given also to the so called “Workload Bias”- the theory that one of two things happen in teams:

1. Only few members work hard,

2. or All members work equally as hard.

The formula used here for defining WorkBias is the entropy of the committed changes over group members. The WorkBias is very low when only few members perform almost all the work, but is a high value when all members distribute the work evenly.

Another important characteristic of successful project is how the teams respond to requests of outside developers. It has been discovered that those projects that deal with pull requests have a better success rate and are with relatively higher popularity. Response time appears to not have significant consequences as long as the requests are dealt with.

Word clouds have been developed based on projects’ README files in order to gain better understanding of the project descriptions. Latent Dirichlet Allocation methods for word distribution of topics has been implemented, as well as Lasso (linear regression models with feature selection). Combining LDA and Lasso is expected  to understand whether a topic contributes positively of a success index.

In conclusion it can be said that the best teams will consist of fewer than 10 members in order to be efficient, teams will constantly deal with changes, well connected members in a group will have higher activity.

Krisi

0 comments