On data value, open source telemetry, Kite, Microsoft and .NET Core

 

It’s now pretty widely accepted that Microsoft has open source religion. The company has embraced open source as a better way of working. But on the data side, it’s actually going further than most open source communities expect or demand. Check out this statement What we’ve learned from .NET Core SDK Telemetry.

“As an open source application platform that collects usage data via an SDK, it is important that all developers that work on the project have access to usage data in order to fully participate in and understand design choices and propose product changes. This is now the case with .NET Core.”

Unambiguous and rather lovely. While code should be open source, it’s generally accepted and acceptable that data about how a service is used is proprietary.

But when about when good code and good intentions get acquired and productised? Today Kite, a startup in the open source telemetry space was in the news, and not in a good way. How a VC-funded company is undermining the open-source community.

“Although Kite has no business model yet, it’s widely thought in Silicon Valley that having users is the first step toward profitability. Adding users potentially benefits the company in another way, by giving it access to precious data. Kite says it uses machine learning tactics to make the best coding helper tools possible. In order to do that, it needs tons of data to learn from. The more code it can look at, the better its autocomplete suggestions will get, for example.”

Given that RedMonk encourages vendors, projects and communities to make better use of their data, build businesses that take advantage of it, to use data as a moat, and to (hopefully) share it with us to better analyse developer choices and trends, these issues are far from academic. Good data governance is super important to us. We strongly err on the side of only using anonymised data based on aggregate behaviours.

The core issue with the Kite situation is arguably about good faith, and setting expectations properly. Users, and this includes software developers, generally don’t mind sharing data that might be used to give a service a proprietary advantage, as long as they

a. know about it up front

b. benefit from it.

The idea that data will only be used for the specific purpose it was collected is enshrined in European data protection law, though it has rarely been prosecuted accordingly. In a particularly flagrant example of disregard for this concept London’s Royal Free Hospital shared patient data with Google’s DeepMind subsidiary without asking patients for permission. Unilateral changes in how data is used are generally not cool. But in open source it is a grey area, while Microsoft is showing leadership by declaring that the data it collects will be share alike, this is, as stated above, not standard practice.

Microsoft states:

The SDK collects the following pieces of data:
The command being used (for example, build, restore).
The ExitCode of the command.
The test runner being used, for test projects.
The timestamp of invocation.
Whether runtime IDs are present in the runtimes node.
The CLI version being used.
Operating system version.
The data collected does not contain personal information. [italics mine]

To talk about the insights a bit, I found a couple of things notable

“Our approach to supporting Linux (one build per distro) isn’t providing broad enough support — .NET Core was used on high 10s of Linux distros yet it only works well on 10-20 distros.”

Kind of surprised me. Dealing with combinatorials like this seems like a major hassle. Quite honestly, doing a good job on 10-20 distros would seem to be “good enough”

“There are gaps in the data that limit our understanding — we would like to know if the SDK is running in a container, for example.”

Yep – insight into containers is definitely non-optional for an SDK to provide useful telemetry in 2017.

There is plenty more to enjoy in the Microsoft data. In conclusion, I would say that it’s great to see Microsoft setting the data governance bar high. This stuff matters. Kite meanwhile clearly has some community issues it needs to address if it plans to build a sustainable business model on developer oriented telemetry. The industry is currently trying to work out how to make open source more sustainable, and do a better job of supporting maintainers. Building businesses that only exploit open source behaviours seems like a retrograde step.

 

Microsoft is a client.

(Read this and other great posts  @ RedMonk)

bio here
LinkedIn Twitter

Leave a Reply