AI (Artificial Intelligence) will make up for the lack of data scientists and the next frontier of “Data Science” as a discipline, will get redefined by the advent of “virtual data scientist”, or “personal analytics assistant” in almost every industry sector and functional roles.
Well, this is not a prophecy I am doing, but this is a profound realization after having met with 100+ start-ups in the last few weeks (my pre-new year’s food for thought escapades).
To start with, here are a few path breaking work already in motion:
- Defence Advanced Research Projects Agency (DARPA) is investing heavily on Data-Driven Discovery of Models (D3M), to help non-experts bridge the “data-science expertise gap,” by allowing artificial assistants to help people with machine learning. DARPA calls it a “virtual data scientist” assistant.
- Hyper Anna, a Sydney-based Australian start-up is all geared up to provide virtual data scientist assistants (business intelligence and on-demand insights based on natural language requests) to financial services organisations, at the touch of a button.
- Lymbyc (formerly called Ma Foi Analytics) intends to become a personal Jarvis for every analytics driven business ninja. In essence, by leveraging artificial intelligence and its sub fields including NLP, ML, Robotics, Knowledge Graphs, etc; Lymbyc is attempting to play the role of an analytics assistant at the cross-section of Industry Type (BFSI, Healthcare, Pharma, Retail, etc) and Functional Roles (Risk Officer, Underwriter, Claims Approver, Drug Discovery Compliant Lead, or even corporate functions like COO, CFO, CSO, etc).
To be honest, when I heard the term “virtual data scientist”, I immediately started thinking about a pool of data scientists that are virtual, however when this term repeatedly came out during my conversation around RPA and its natural evolution towards “Cognitive RPA”, I began to realize this is something else.Interestingly, the only offering that I came across talking about cognitive RPA was Workfusion. However, their interpretation and implementation of virtual data scientist, in my view, is slightly narrowed and is all about online learning/reinforcement learning in the form of micro-bots providing self-learning capabilities within the workflows of a business process.
The larger purpose of virtual data scientist as it is emerging now, is to address a specific problem area.With a shortage of data scientist talent in the market, is there an alternative solution for businesses to scale their analytics requirements in a very efficient and in a cost-effective manner?
What if Siri were a Data Scientist?
Our day in life today has gotten accustomed to asking questions virtually – to personal assistants like Siri, Cortana, Google Now, Alexa and probably a few other that are not so common. So far, the interactions with these systems have been prevalent in the areas of assistance we need to organize our personal preferences.
The Data Science fraternity will violently argue with me that these tasks are not what data scientists solve, so do not trivialize. I agree.
Now, allow me to zoom into a more familiar situation in our professional life – the dreaded weekly COO review call (personally, I report to 4 bosses and each one has a different question on my day in life!). Alright, it’s Tuesday evening, the summit room is crowded, the coffee cups are full, and the questions are flowing.
Sales pipeline, revenue forecasting, utilization, PAT (profitability after tax), investment asks, are high on the agenda, yet these simple and straight questions often take a long time to get to a satisfactory answer. Why?Because a simple request such as getting an answer from a question asked is a daunting task. Not because we have not realized the value of data, but because our reports are not inherently cognitive by design. In other words, our reports are carefully designed to deliver to a specific set of known questions.
In review meetings, yes these reports help us as a starting point, but then when subsequent questions around why’s and what-if’s and so-what’s follow, we scamper all around to piece together islands of information to match the speed of thought (rather speed of previously unasked questions).In essence, while organisations have realized the value of data, yet they remain challenged to leverage such information.
Now, imagine having a virtual data scientist at your disposal, you push a button on your laptop, or use your smart phone and start asking questions:
- How are we doing on order booking? To answer this question, we will pull out a report showing #s against a timeline, which can be further sliced and diced into geography, segments, etc. But that is not what human cognition is, the intent behind this question is to look for actionable insights, such as:
– How is it compared to last 2 quarters, last year same time, which offerings are driving this growth or de-growth, are there any correlations to customer segments or geography, how is it compared to our competitors?
- How are we doing on our profitability? To answer this question, we will pull out a report showing %s against each delivery unit or a business unit, which can be further sliced and diced into high level cost centers, etc. But again, how do you follow the intent, such as:
– What are the top three reasons contributing to a lower PAT, are there any account specific observations or it is all over the place, what levers we can use to improve PAT?
What I am hinting at, is a personal data scientist that turns your request into a query, goes across all known data systems for an answer, applies analytics and cognitive learning, and then provides you with an answer – could be in the form of a visual, a set of references, a set of competitor intelligence, or quite simply a set of numbers with annotations and context highlighted against those.
Wow! This is turning out to be an interesting and utopian solution, but what If you don’t know what to ask! For example, “You are interested in your people supply chain” -your personal analytics assistant takes this as a very broad level question and starts putting together a dossier starting with some basic insights and recommendations for further investigation. Good starting point!
Redefining Data Science
All businesses, regardless of scope or size, deserve to have access to data scientists to drive value from the data they create and own. Dealing with the staggering demand for data analysis is not a human scaling issue, and throwing more bodies at mountains of data is not the solution. Hence, the proposition that the current data science methodology and practitioners approach, needs to be redefined to embrace automation in data science and machine learning.
Well, don’t pull a fast one, Data Science is not all about science, there is a lot of art in it as well, you just can’t automate everything. I agree.
It is well-known that data scientists spend significant amount of time acquiring, cleaning and organizing data. They also use a number of tools, technologies and frameworks, in order to gain a holistic view of data, before they start applying statistical analysis and machine learning to derive meaningful insights. This is time-consuming and precisely the problem enterprises are grappling with. Thisprompts me to say that organisations will never experience the full potential of data science if they continue to waste deeply specialized human capital on tasks that can be more effectively and efficiently performed by computers. Hence, the argument about introducing automation and democratizing data science, so that organisations can put analytics into the hands of the masses to focus on big-picture insights and solutions.
However, automation in data science and machine learning means a quantum leap.This means data science and algorithms can’t be a black box anymore and can’t be the job a selected few.It needs to be intuitive, repeatable, scalable and also iterative in nature to continuously learn without manual intervention. What we are referring to here is called “Cognitive Data Science”.
During my conversation with various start-ups in the AI space, I realized that there are some fundamental issues that are yet to be addressed beyond the glorification of data science solving real world problems. For example: “CXOs struggling to get timely responses from different business units and want to avoid duplicating analytic initiatives across the enterprise”.The answer is not about centralizing all your enterprise data at one place, hire more data analytics guys, or establish a state of the art analytics platform. The answer is, a virtual data scientist, whocan operate across the departmental silos on any kind of data and can be accessed by all individuals within the organisation.
Anatomy of the virtual Data Scientist
At a broad level, the virtual data scientist is nothing but a system, which is trained for use cases at the cross-section of Industry type and Business Functions, with deep capabilities consisting of the following:
- The Interface: There is something intuitive, easy-to-use, and at the same time fascinating with the Natural-Language Question Answering interfaces, such as those used in Siri, they make the whole process simpler. To truly democratize data science in an enterprise, the virtual data scientist should allow every person in an organisation to interact with it, thus the interface is the most critical component and should have a sophisticated NLP framework in built. Further, the virtual data scientist should allow the automatic rendering of insights by embedding the output into other business applications, such as CRM system, ERP system, or marketing automation systems.
- The Problem-Parser: The most critical aspect of a virtual data scientist is to follow the speed-of-questions, thus it requires anuncanny ability to not only parse and tag speech or text, but also to simultaneously contextualize the query to a set of sub-queries that can be executed in all sources of data – internal and external. This means, the virtual data scientist needs to be equipped with a library of ontologies and taxonomies, going beyond the core English ontology, and leverage multiple levels of industry/sub-industry specifics and functional role specifics.
The Expert System: This is the nerve center of the virtual data scientist; it understands the context and initiates multiple actions including fetching data, initiating algorithm runs, and building of visuals. From a flow perspective, the expert system identifies that an activity needs to be initiated for the query at hand, and based on this input, the relevant algorithms andassociated processes get invoked.
- The Corpus: This is the heart of the virtual data scientist. Watson, as it enjoys the fame today (although in a selective area of expertise), is only because of the rich corpus it has got access to. What this really means is, your algorithms and outputs as good as your data. Thus, one can think of going deep into certain narrow domain space such as drug discovery within Pharma or risk profiling within retail banking or reconciliation in corporate banking, or go as wide as customer on-boarding across all industries! Basically, you need an exhaustive amount of structured and unstructured data about a specific process, the more you have, the better your virtual data scientist’s performance will be.
- Recommendation Engine: It facilitates contextually relevant recommendations to appear as the user interacts with the system. The objective here is to map the user journey, as they interact with the virtual data scientist and use this learning to improve further. If done well, the recommendation engine will eventually become sophisticated to an extent that it will pre-empt what the user is going to ask next, and in some way would match the speed of thoughts. (Are we talking about AI agents taking over the job of Data Scientists? Oops!!) To achieve this level of sophistication, the current practice of offline learning will not be of much use, hence heavy usage of online learning and reinforcement learning needs to be leveraged. The intent is to continuously learn and through the feedback mechanism that takes the user input on the outputs that are presented, seek feedback and learn user preferences over time.
All said and done, as such it is very difficult to follow how we human think and function. This means that it is hard to imagine all the possibilities that one would expect the virtual data scientist to solve. However, there is a strong belief that artificial intelligence, and its sub fields including NLP, ML, Robotics, knowledge graphs, will become more and more prevalent in the near future. We are already seeing chat-bots, personal assistants and RPAs in action solving problems that are repetitive by nature. What fascinates me is the disruptive thinking some of the companies, such as the likes of Lymbyc are bringing to automate data science. Irrespective of whether we will see a “depth focused virtual data scientist” solving a narrow problem area, or a“broad focused virtual data scientist” solving a general problem area, in the coming years, it is certain that we will see increasing usage of embedded AI apps, such as virtual data scientists coming into play in every function of our daily life – personal or professional.
We discussed how Natural Language Generator (NLG) accelerates the entire data-to-decision journey…
The capital market industry is going through foundational changes, wherein firms are adapting…