Human activities produce massive amounts of raw data presented in the form of tables. In order to understand these tables quickly, EURECOM and Orange are developing DAGOBAH, a semantic annotation platform. It aims to develop a generic solution that can optimize AI applications such as personal assistants, and facilitate the management of complex data sets of any company.
On a day-to-day basis, online keyword searches often suffice to make up for our thousands of memory lapses, clear up any doubts we may have or satisfy our curiosity. The results even anticipate our needs by offering more information than we asked for: a singer’s biography, a few song titles, upcoming concert dates etc. But have you ever wondered how the search engine always provides an answer to your questions? In order to display the most relevant results, computer programs must understand the meaning and nuances of data (often in the form of tables) so that they can answer users’ queries. This is one of the key goals of the DAGOBAH platform, created through a partnership between EURECOM and Orange research teams in 2019.
DAGOBAH’s aim is to automatically understand the tabular data produced by humans. Since there is a lack of explicit context for this type of data – compared to a text – understanding it depends on the reader’s knowledge. “Humans know how to detect the orientation of a table, the presence of headings or merging lines, relationships between columns etc. Our goal is to teach computers how to make such natural interpretations,” says Raphaël Troncy, a data science researcher at Eurecom.
The art of leveraging encyclopedic knowledge
After identifying a table’s form, DAGOBAH tries to understand its content. Take two columns, for example. The first lists names of directors and the second, film titles. How does DAGOBAH go about interpreting this data set without knowing its nature or content? It performs a semantic annotation, which means that it effectively applies a label to each item in the table. To do so, it must determine the nature of a column’s content (directors’ names etc.) and the relationship between the two columns. In this case: director – directed – film. But an item may mean different things. For example, “Lincoln” refers to a last name, a British or American city, the title of a Steven Spielberg film etc. In short, the platform must resolve any ambiguity about the content of a cell based on the overall context.
To achieve its goal, DAGOBAH searches existing encyclopedic knowledge bases (Wikidata, DBpedia). In these bases, knowledge is often formalized and associated with attributes: “Wes Anderson” is associated with “director.” To process a new table, DAGOBAH compares each item to its database and proposes possible candidates for attributes: “film title”, “city” etc. But they must remain simply candidates. Then, for each column, the candidates are grouped together and put to a majority vote. The nature being sought is therefore deduced with a varying degree of probability.
However, there are limitations to this method when it comes to complex tables. Beyond applications for the general public, industrial data may contain statistics related to business-specific knowledge or highly specialized scientific data that is difficult to identify.
Neural networks to the rescue
To reduce the risk of ambiguity, DAGOBAH uses neural networks and a word embedding technique. The principle: represent a cell’s content in the form of a vector in multidimensional space. Within this space, vectors of two words that are semantically close to one another are grouped together geometrically in the same place. Visually speaking, the directors are grouped together, as are the film titles. Applying this principle to DAGOBAH is based on the assumption that items in the same column must be similar enough to form a coherent whole. “To remove ambiguity between candidates, categories of candidates are grouped together in vector space. The problem is then to select the most relevant group in the context of the given table,” explains Thomas Labbé, a data scientist at Orange. This method becomes more effective than a simple search with a majority vote when there is little information available about the context of a table.
However, one of the drawbacks of using deep learning is the lack of visibility about what happens inside the neural network. “We change the hyperparameters, turning them like oven dials to obtain better results. The process is highly empirical and takes a long time since we repeat the experiment over and over again,” explains Raphaël Troncy. The approach is also time-consuming in terms of computing time. The teams are also working on scaling up the process. As such, Orange’s dedicated big data infrastructures are a major asset. Ultimately, the researchers seek to implement an all-purpose approach, created in an end-to-end way and which is generic enough to meet the needs of highly diverse applications.
Towards industrial applications
The semantic interpretation of tables is a goal but not an end. “Working with EURECOM allows us to have almost real-time knowledge about the latest academic advances as well as an informed opinion on the technical approaches we plan to use,” says Yoan Chabot, a researcher in artificial intelligence at Orange. DAGOBAH’s use of encyclopedic data makes it possible to optimize question/response engines in the kind of natural language used by voice assistants. But the holy grail will be to provide an automatic processing solution for business-specific knowledge in an industrial environment. “Our solution will be able to address the private sector market, not just the public sector, for internal use by companies who produce massive amounts of tabular data,” adds Yoan Chabot.
This will be a major challenge, since industry does not have knowledge graphs to which DAGOBAH may refer. The next step will therefore be to succeed in semantically annotating data sets using knowledge bases in their embryonic stages. To achieve their goals, for the second year in a row the academic and industry partners have committed to take part in an international semantic annotation challenge, a very popular topic in the scientific community. For four months, they will have the opportunity to test their approach in real-life conditions and will compare their results with the rest of the international community in November.
To learn earn more: DAGOBAH: Make Tabular Data Speak Great Again
Anaïs Culot for I’MTech