The impact of language choice on github projects

mbostock · on Jan 15, 2012

2009. Note the rise in popularity of JavaScript since then: https://github.com/languages

Also, box plots would provide more useful comparison than medians; box plots can show several quantiles simultaneously. This is almost essential if you don't know anything else about the distribution. Still enjoyed the analysis, though, and I think it's an interesting space for further exploration!

aveeno · on Jan 15, 2012

"it becomes harder and harder to contribute to a Perl codebase, the bigger it gets."

I've never written any perl, but is this really true? I fail to see how its really any worse than ruby/python.

jleader · on Jan 15, 2012

Wait, are there programming languages where as a codebase grows, it doesn't get more complex?

I've worked on reasonably large codebases in Perl, C, C++, Java, and Modula-II, and I'd say that in all cases, as the codebase grew, it got more complex and harder to contribute to.

reactor · on Jan 15, 2012

"I've never written any perl" that says it all. Try it, you will understand. B/w what makes you think ruby/python is worse?

pyre · on Jan 15, 2012

  > "I've never written any perl" that says it all. Try it,
  > you will understand.

Let's not start a language war here, please.

  > I fail to see how its really any worse than ruby/python.

  > B/w what makes you think ruby/python is worse?

I fail to see where the parent post claimed that ruby/python were worse than Perl.

My interpretation is that the parent poster believes that Perl, Ruby, and Python all suffer from the idea that the larger the codebase gets, the more complex it gets. That complexity then becomes a barrier to entry for contribution to the project.

szabgab · on Jan 15, 2012

... and Java and C and C# and even PHP gets more complex as the code base grows. I am sure you don't pretend that naming one, or 3 does not send the message that that the others are not.

pyre · on Jan 15, 2012

I only named those three because they were within the context of the discussion. Nothing more. The idea that complexity increases over time and become a barrier to entry applies to a great many things.

zdw · on Jan 15, 2012

Perl and ruby are pretty similar (much of ruby was inspired by perl according to Matz), but perl is much older than ruby, and thus many people are used to language idioms that aren't particularly modern.

There's also a tradition of writing one liners ala sed/awk in perl, and someone going from that to more complete programs isn't generally the best for program structure.

Python's whitespace requirements tend to make it clearer to read, as there are fewer ways to format identical code, and it's a much less of a TMTOWTDI language than perl.

tl/dr: Perl code is often written by old timers who expect everyone to be a perl genius like they are.

chromatic · on Jan 15, 2012

Python's whitespace requirements tend to make it clearer to read...

What projects have you worked on such that consistency of indentation was at all a meaningful factor to maintenance? I worry more about duplication and near duplication, testing efficacy, proper factoring, symbol naming, effective error handling, the possibility of fencepost errors, and clarity of intent.

kamaal · on Jan 15, 2012

Very well put.

In my experience unless you are hiring exceptionally bad programmers. You don't really have to worry about code indentation.

Seriously code indentation is all people worry about? I have bigger problems to worry about.

pyre · on Jan 15, 2012

  > tl/dr: Perl code is often written by old timers
  > who expect everyone to be a perl genius like they are.

While I can agree with this sentiment, it seems to come close to the idea that you should ignore language features in favor of making it newbie friendly. You don't want to ignore features of a language just because it might not be entirely clear to someone that is new to the language.

That said, there is a lot of Perl code out there that was written in the past, or recently by people using old idioms/styles. I've even come across people (recently) expressing their dislike of Perl based on their experience with Perl 4.x (hint: Perl 5 was released in 1994).

sirclueless · on Jan 15, 2012

> You don't want to ignore features of a language just because it might not be entirely clear to someone that is new to the language.

I think this has a lot to do with why perl projects tend to accumulate complexity over their lifetimes. It's not an easy decision to ignore language features, and it's a decision that gets made each time a programmer writes more code, so it's inevitable that in a large program in a language with as many deep features as perl has, some programmer will choose to use each of them. The abundance of advanced features in use in a large project then forms a barrier of entry to a newbie who must understand each of them as he encounters them; while each feature and decision may make the most sense for its context, the preponderance of features means it takes a long time for new contributors to get up to speed.

As to why this is different from Ruby/Python or many other more recent languages: Most popular recent languages are extremely resistant to the addition of any language features. They try to have canonical features that cover many use cases, for example Python's iterators and generators, or Ruby's block syntax. By having only a few options open for developers, new contributors are more likely to immediately grok the code of even large projects. Consider the statement "There should be one-- and preferably only one --obvious way to do it" from the Zen of Python[1], which is a gentle stab at the Perl motto, "There is more than one way to do it"[2].

[1] http://www.python.org/dev/peps/pep-0020/ [2] http://c2.com/cgi/wiki?ThereIsMoreThanOneWayToDoIt

chromatic · on Jan 15, 2012

... the preponderance of features means it takes a long time for new contributors to get up to speed.

What projects have you worked on that language features are a greater barrier to effective contribution than a deep understanding of the problem domain, adherence to project coding standards, and design principles of the code base?

You can look up language features in the manual.

ccashell · on Jan 17, 2012

I found the statistics to be quite interesting. The conclusions drawn by the author, however, left a lot to be desired. Most of them were not directly (or even indirectly) supported by the data, and consisted primarily of random opinions that left me feeling negatively about an otherwise interesting post.

I would have also like to see data on the mean for some of the statistics, and not just the median. In some cases, one or the other can provide very misleading statistics, and providing both to compare would have helped smooth over concerns there.

flatline · on Jan 15, 2012

I found the median commits and committers to be interesting - there must be a lot of people doing a few number of commits, which is what I would expect more in e.g. forum posts than source code. I would like to have seen the mean too. As to the title, I don't think that the numbers really say anything about the impact of language choice.

shawnps · on Jan 15, 2012

> First, the sample size. Clearly, github is very popular with the Ruby crowd, with more than four times as many projects as Python, the runner-up.

It looks like Python is at ~400 in the graph, and Ruby is at ~1200. Am I missing something?

chrismealy · on Jan 15, 2012

How much of the javascript on github is just jquery bundled in rails projects?

dpkendal · on Jan 15, 2012

I'm fairly sure that GitHub counts languages by projects, not by files. So a few JavaScript files inside a Ruby project would not affect the ranking.

davorb · on Jan 15, 2012

No Haskell? I'm surprised.

dons · on Jan 15, 2012

While Haskell is ranked ahead of eg scala or erlang in total repos on github, the majority of larger/active projects in Haskell are still in darcs repos, on places like code.haskell.org.

Historical note: the Haskell dev world jumped on distributed vcs a few years before git appeared, via darcs, and that early move has resulted in the situation today, where it's mostly only the younger projects in git.

dons · on Jan 15, 2012

And just as an example, I have around 200 darcs repos, including big projects like xmonad, yi or bytestring, that might never be on github, since they're in darcs on c.h.o.

The Haskell world is really an anomaly in this regard.

gnaritas · on Jan 15, 2012

Not really, same applies to Smalltalk, and I'd guess Lisp as well.

nknight · on Jan 15, 2012

"Median" and "average" are being used inconsistently in this article, making it more difficult to meaningfully interpret the dataset.

cortesi · on Jan 15, 2012

Hi there. I think you're interpreting "average" as being equivalent to "mean", but this isn't my understanding of the term. Both the "median" and "mean" are measurements of "averageness". I use "median" when I'm talking directly about figures or a diagram. I use "average" when I'm making some general point, and only after I've already specified what my exact measure of central tendency was.

http://en.wikipedia.org/wiki/Average

spullara · on Jan 15, 2012

Median is an exact term that means the middle value in a sorted list of values.

cortesi · on Jan 15, 2012

Yes, I don't think anyone has disputed that.

nknight · on Jan 15, 2012

I've never heard anyone use "median" and "average" interchangeably who wasn't confused about what "median" and "mean" mean, and every math class I've ever taken involved a reminder that "median" and "average" are different.

Dig up arcane usages to support your confusing wording all you want, but the point of language is communication, and when you use words in significantly different ways from other people, you're communicating unclearly.

cortesi · on Jan 15, 2012

Then your maths professors were more lax than mine, who were always careful to distinguish between the general term "average" and specific terms like "mean" and "median".

As for the accusation that I'm being arcane - all I can do is point you to that most un-arcane and accessible of all reference works, Wikipedia. It spells it out quite clearly, and agrees with me entirely.

pyre · on Jan 15, 2012

1. In general usage, when people say 'average' they almost always mean the mean, not the median.

2. Pointing out that something is on Wikipedia does not prove how un-arcane it is. I'm sure there are plenty of topics on Wikipedia that are only understood by a small group of people. Wikipedia doesn't require that all pages be common knowledge to the general population before being accepted.

cortesi · on Jan 15, 2012

People do often use the word "average" when they mean "mean" - but they also often use it when they mean "median". It depends on the context. For instance, when I say "the average family has an income of X" or "the average man is Y cm tall", the concept that most reasonably applies is the median, not the mean.

The simple fact is that "average" is a general term which is only used correctly once you've clarified what your exact measure is.

elemeno · on Jan 15, 2012

No, those clearly refer to the mean not the median. People do not ask the question "Given the range of heights in the population, what is the middle value", they want to know the mean height of the population.

This is especially easy to see when people talk about "average income", since it's clearly skewed by there being a long tail on the high-income side which pushes up the average salary. There's a reason why income is generally bracketed instead, and split into quintiles.

cortesi · on Jan 15, 2012

Look more carefully at my wording - most of the time, when you hear a news report using wording like "the income of the average family..." (as opposed to "average income"), you're hearing a median value, not a mean. The newsreader probably doesn't know it, but this is precisely because the median is less skewed by that long tail you mention.