Computer science is different from other sciences. Computer scientists usually focus more on algorithms and less on the datasets they are using. In addition, in computer science everything is either false or true. This is rarely the case in other scientific disciplines (Skiena 2017).
Data science is more like the other sciences in this regard. It is usually very complicated to construct data sets and probabilistic approaches play an important role. Data science is also the search for insights based on data and there are two different approaches (Skiena 2017).
The first approach is the hypotheses-driven paradigm. A research paper usually starts with a hypothesis based on the literature. A hypothesis might be that “an increase in the number of vaccinated people reduces the overall number of infections”. The researcher would then search the relevant data and verify or falsify the hypothesis. The researcher will soon realize that it depends on the season, the number of infected persons, general mobility, and many other variables.
Another researcher might postulate that refugee children are more likely to become entrepreneurs as they are used to manage scarce resources. The researcher might then search data to verify or falsify the hypothesis. Data can include the biography of entrepreneurs, the growth as well as failure rate, the year of the creation of the company or the industry.
This might be a better approach for many research questions as it reduces the risks of random relationships. However, it might be easy to miss interesting observations.
The second approach is the data-driven paradigm which starts with a dataset and asks what kind of interesting questions can be addressed with it. This makes sense when there are too many variables to consider. Findings from this approach should always be taken with some caution as some effects might be purely random and non-replicable in other contexts. However, these findings can be corroborated with further analyses.
There are two common problems which can be seen in this space. Models either try to classify or predict with a sample of data. The first common problem is the classification or data points. You might want to know if an e-mail is spam or which documents share the same features. It can also be a problem of deciding on the genre of a certain book or movie. The second common problem is to predict the outcome with a set of input variables. This is usually done with regression analyses. For example, an insurance company has information on know the weight, height and education and wants to predict the life expectancy of its clients.
The operating model which dominated in the past was the analytical approach. Companies were gathering data and used the data to improve its operations. For example, hospital managers could use data to identify bottlenecks in the flow of patients. Waiting times are a good indicator to see where improvements can be made.
The new data-driven approach changes this approach. Data is the at the core of the business and operating model. For example, Amazon can use the insights from other customers to suggest additional products. Spotify can construct playlists based on preferences of clients with similar music tastes. These business models would not be possible without access to data.