According to scholars in major areas of science, it is impossible to reproduce which can also be severe. A scenario of Bayer Health care reviewed 67 projects in 2011, found that the replication is done for less than 25%. And 2/3 rd of the project had large irregularity. And another example is the most recent investigation on November stated that only half could be replicated. Other fields like economics and medicine had reported the same. All these striking results resulted in the deep loss of credibility of the major scientist.
When coming to the issue of big data there are many factors influencing it. According to statistics, there is a huge difference in the way scientific inventions are done in the big data era. The crisis of reproducibility is partly driven by invalid statistical analysis derived from data-driven hypotheses the opposite of the way things are done traditionally.
In previous methods of experimenting science, both statisticians and scientist work together. At first, scientist conducted experiments to collect the data and later statisticians analyse the data which is collected. In 1920, in the research of academics, a women claimed that she could guess the flavours of milk or tea which was added first to the cup. This is famous as “lady tasting tea”, which was doubted by Ronald fisher that whether she could guess the taste. They develop a model based on the probability called hypergeometric distribution. He hypothesized that, out of eight cups of tea, prepared in such a way that four cups plus milk first and four other cups plus tea, the number of correct guesses would follow the probability.
Eight cups of tea are sent in a random order to the lady in order to conduct the experiment. And to the surprise, she detected all the eight cups correctly which is strong evidence against fisher’s hypothesis. The percentage is low as 1.4 per cent in which the lady had achieved the correct answers by random guessing. With the help of today technology, we can collect a huge amount of data about 2.5 extra bytes a day. The process hypothesis involves gathering data and analyzing is spare in the era of big data.
The development of science is much slower and the researchers may not know the right hypothesis while analyzing the data. The process of lady tasting tea the order of seeing the data and hypothesis had been reversed.scientists can now collect tens of thousands of gene expressions from humans, but it is very difficult to decide whether one must enter or exclude certain genes in the hypothesis. In this case, it is interesting to form a hypothesis based on data. While such a hypothesis might seem interesting, the conventional conclusions of this hypothesis are generally invalid.
Problems with Data:
Let us consider a 100 ladies tasting tea for considering big data. For example, there are 100 ladies who don’t know the taste of tea but after the experiment of guessing there are 75.6 per cent chances that at least one lady could luckily guess it. The scientist may be surprising to find that one lady who could guess all the cups correctly.
If the same experiment is conducted again with the same lady the result might not be reproducible as the first time result may be due to luck and hence she doesn’t even know the real difference between tea and milk. This example illustrates to us how scientist is coming through interesting from the dataset. They can formulate hypotheses after these signals, then use the same dataset to draw conclusions, claiming these signals are real. It may take a while before they find that their conclusions cannot be reproduced. This problem is very common in large data analytics because of the large data size, just by chance, some false signals might fortunately occur. To produce the most publishable result the data is to be manipulated by a scientist with the help of this process.
The only way in which the scientist can achieve productivity and avoid all the problems is being more careful. To provide the valid inferences new design procedures should be designed by statisticians. If scientists want results that can be reproduced from data-driven hypotheses, then they need to consider data-driven processes carefully in the analysis.
The most optimal way to extract information from the analytic data is statistics. It is the most evolving field in nature in which the big data era is just an example of evolution problems. The development of a statistical technique which produces interesting and valid scientific discoveries.