The Shift From Open Source To Commercial Data Analytics Is Placing Cost Over Accuracy

<div _ngcontent-c14="" innerhtml="


Getty ImagesGetty Images

The era of “big data” has been marked by a cataclysmic break from statistics. With the loss of the denominator across much of modern data science and a growing departure from the idea that the quality of our data influences the accuracy and representativeness of our results, we seem to have entered a "post-statistics" era of big data. One of the key driving forces behind this transition has been the shift from open source tools and open data to opaque datasets processed through black box algorithms that make reproducibility and accuracy assessments impossible. In an era in which we no longer seem to care about the accuracy of our results, what does the future of data science hold in an increasingly proprietary world?

The modern era of data science was once built upon an open and transparent world of open source software and open data, wielded by statistically literate technical experts who deeply understood both the tools and data they were using. Every algorithm was cited back to its complete description in the academic literature and supported by myriad public case studies. Every implementation was in the form of open source software that could be inspected and improved. A focus on mathematical accuracy over marketing hyperbole meant tools were typically upfront about their biases and limitations, backed by a large published archive of case studies across disciplines and datasets. Implementers themselves often hailed from the sciences with strong algorithmic and numerical methods backgrounds, ensuring a rigorous focus on accuracy and completeness.

In contrast, as data science has become ever more commercialized, that transparency has given way to the opacity of the enterprise world. Tools are closed source, algorithms are proprietary, technical documentation is scarce and implementers are frequently enterprise developers lacking the traditional numerical methods backgrounds and relentless focus on accuracy and completeness that define the world of scientific code.

The scientific codes that once defined the data science space are typically designed for expert use, filled with knobs and dials to adjust every available parameter of the underlying algorithms with absolute precision. All that complexity requires algorithmic, statistical and technical understanding that fewer and fewer data scientists possess. Moreover, our lack of understanding of the commercial datasets that increasingly define the big data era means even analysts with deep statistical backgrounds lack the necessary insights into their data to be able to appropriately tune their algorithms.

The scientific world’s emphasis on correctness means the computational cost of analyses is typically secondary to ensuring the accuracy and completeness of their results.

In contrast, cost rules supreme in the commercial world, creating strong incentives to adopt damaging optimizations like aggressive sampling and reduced numerical precision or dangerous implementation shortcuts that can invalidate results. In the deep learning space many of these tradeoffs are exposed to developers, but those without a background in the underlying mathematics may not fully understand the ramifications of their decisions when opting to prioritize speed over accuracy to reduce the size and execution time of their models.

Data analytics is increasingly becoming a turnkey “point and click” affair, where all of the complexity and nuance of the underlying algorithms are hidden from the user. A sentiment analysis might extrapolate its results from a 1% random sample of the data without every letting on to the analyst that the results are based on anything other than a population assessment.

Results have become about “good enough” rather than “correct” and "complete."

To those entering the world of data science from a traditional HPC background, building scientific codes running on traditional supercomputers where even the underlying hardware circuitry is known and accounted for, the opaque “trust us” and “good enough” leap of faith of the commercial data science world can be jarring.

Reproducibility is also far more difficult in the enterprise world. Many commercial analytics companies are constantly improving their algorithms, meaning the same analysis run a few days later may yield wildly different results. Even when using the exact same data and parameters, results may not be repeatable, making it impossible to know if the original analysis was incorrect or whether the analytics vendor simply changed their algorithm without notice.

Analytics companies with built-in datasets, like social media analysis platforms, often fail to reprocess their historical data when making algorithmic changes. A number of major commercial social analytics platforms routinely make breaking changes to their core algorithms without updating their historical data, resulting in longitudinal analyses whose findings are nearly entirely algorithmic artifacts rather than genuine patterns in the underlying data.

Closed platforms analyzing closed datasets also make it far too easy for bad science to flourish when their results can never be verified or externally scrutinized for mistakes or malfeasance.

It doesn’t have to be this way.

Some analytics platforms, especially those of the major cloud vendors, differentiate themselves through their focus on the accuracy and completeness of traditional scientific workflows. Many of these platforms are essentially built as software interfaces to the vendor’s hardware rental business, where the focus is on providing maximal accuracy tools to process the customer’s own data, building a steady stream of hardware rental. These platforms offer a hybrid between the transparency of the scientific world and the opacity of the commercial world. While their underlying source code may be proprietary, the platforms themselves typically implement well-known algorithms with detailed technical documentation of their specific implementations and restore full control over all of the algorithm’s configuration options. Some even open source the algorithmic portions of their platforms for maximal transparency or make the entire toolkit open source, with the benefit being its optimization for their specific cloud offerings.

Putting this all together, as data science matures as a field, we need to far more carefully balance the convenient opacity of turnkey analytics platforms with the more complex transparency of the scientific world. Some analytics platforms have managed to blend these two competing demands quite well, but much of the “big data” world, especially social media analytics, remains in the shadows.

“>

The era of “big data” has been marked by a cataclysmic break from statistics. With the loss of the denominator across much of modern data science and a growing departure from the idea that the quality of our data influences the accuracy and representativeness of our results, we seem to have entered a “post-statistics” era of big data. One of the key driving forces behind this transition has been the shift from open source tools and open data to opaque datasets processed through black box algorithms that make reproducibility and accuracy assessments impossible. In an era in which we no longer seem to care about the accuracy of our results, what does the future of data science hold in an increasingly proprietary world?

The modern era of data science was once built upon an open and transparent world of open source software and open data, wielded by statistically literate technical experts who deeply understood both the tools and data they were using. Every algorithm was cited back to its complete description in the academic literature and supported by myriad public case studies. Every implementation was in the form of open source software that could be inspected and improved. A focus on mathematical accuracy over marketing hyperbole meant tools were typically upfront about their biases and limitations, backed by a large published archive of case studies across disciplines and datasets. Implementers themselves often hailed from the sciences with strong algorithmic and numerical methods backgrounds, ensuring a rigorous focus on accuracy and completeness.

In contrast, as data science has become ever more commercialized, that transparency has given way to the opacity of the enterprise world. Tools are closed source, algorithms are proprietary, technical documentation is scarce and implementers are frequently enterprise developers lacking the traditional numerical methods backgrounds and relentless focus on accuracy and completeness that define the world of scientific code.

The scientific codes that once defined the data science space are typically designed for expert use, filled with knobs and dials to adjust every available parameter of the underlying algorithms with absolute precision. All that complexity requires algorithmic, statistical and technical understanding that fewer and fewer data scientists possess. Moreover, our lack of understanding of the commercial datasets that increasingly define the big data era means even analysts with deep statistical backgrounds lack the necessary insights into their data to be able to appropriately tune their algorithms.

The scientific world’s emphasis on correctness means the computational cost of analyses is typically secondary to ensuring the accuracy and completeness of their results.

In contrast, cost rules supreme in the commercial world, creating strong incentives to adopt damaging optimizations like aggressive sampling and reduced numerical precision or dangerous implementation shortcuts that can invalidate results. In the deep learning space many of these tradeoffs are exposed to developers, but those without a background in the underlying mathematics may not fully understand the ramifications of their decisions when opting to prioritize speed over accuracy to reduce the size and execution time of their models.

Data analytics is increasingly becoming a turnkey “point and click” affair, where all of the complexity and nuance of the underlying algorithms are hidden from the user. A sentiment analysis might extrapolate its results from a 1% random sample of the data without every letting on to the analyst that the results are based on anything other than a population assessment.

Results have become about “good enough” rather than “correct” and “complete.”

To those entering the world of data science from a traditional HPC background, building scientific codes running on traditional supercomputers where even the underlying hardware circuitry is known and accounted for, the opaque “trust us” and “good enough” leap of faith of the commercial data science world can be jarring.

Reproducibility is also far more difficult in the enterprise world. Many commercial analytics companies are constantly improving their algorithms, meaning the same analysis run a few days later may yield wildly different results. Even when using the exact same data and parameters, results may not be repeatable, making it impossible to know if the original analysis was incorrect or whether the analytics vendor simply changed their algorithm without notice.

Analytics companies with built-in datasets, like social media analysis platforms, often fail to reprocess their historical data when making algorithmic changes. A number of major commercial social analytics platforms routinely make breaking changes to their core algorithms without updating their historical data, resulting in longitudinal analyses whose findings are nearly entirely algorithmic artifacts rather than genuine patterns in the underlying data.

Closed platforms analyzing closed datasets also make it far too easy for bad science to flourish when their results can never be verified or externally scrutinized for mistakes or malfeasance.

It doesn’t have to be this way.

Some analytics platforms, especially those of the major cloud vendors, differentiate themselves through their focus on the accuracy and completeness of traditional scientific workflows. Many of these platforms are essentially built as software interfaces to the vendor’s hardware rental business, where the focus is on providing maximal accuracy tools to process the customer’s own data, building a steady stream of hardware rental. These platforms offer a hybrid between the transparency of the scientific world and the opacity of the commercial world. While their underlying source code may be proprietary, the platforms themselves typically implement well-known algorithms with detailed technical documentation of their specific implementations and restore full control over all of the algorithm’s configuration options. Some even open source the algorithmic portions of their platforms for maximal transparency or make the entire toolkit open source, with the benefit being its optimization for their specific cloud offerings.

Putting this all together, as data science matures as a field, we need to far more carefully balance the convenient opacity of turnkey analytics platforms with the more complex transparency of the scientific world. Some analytics platforms have managed to blend these two competing demands quite well, but much of the “big data” world, especially social media analytics, remains in the shadows.

(Excerpt) Read more Here | 2019-03-09 21:31:00
Image credit: source

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.