Research
Data mining and Knowledge Discovery
Graph-based methods
Probabilistic mixture models
Classification
Feature selection
Aggregated search in graph databases
Medecine
medico-economic domain
The proposed methods are used in many research projects and collaborations:
In this research stay, we are interesting
on two research fields: Sequential data analysis and Machine Learning.
In the first part, we are focused to propose a new solution for the problem of sequential data clustering. The proposed framework is a graph
and probability based one which tries to give an assignment of clusters to the sequences when the number of clusters is not specified in advance.
In the near future, we plan to extending our framework to deal with the clustering of semantic web services based on their behavior modeling.
In the second part of this research stay, we have considered the problem of online clustering in the form
of data insertion and we have started the development of a new approach. The difference between these learning
approaches and the traditional ones in particular is the ability to process instances as they are added (new data) in the data collection,
eventually with an updating of existing clusters without having to frequently performing complete re-clustering.
In this part, we have worked to propose an algorithm in order to improve the performances of an original proposed one in terms of runtime.
Abstract: Recent years have seen the development of data mining techniques in various application areas, with the purpose of analyzing
large and complex data. The medical field is one of these areas where available data are numerous and described using various attributes,
classical (like patient age and sex) or symbolic (like medical treatments and diagnosis). Data mining generally includes either descriptive
techniques (which provide an attractive mechanism to automatically find the hidden structure of large data sets), or predictive techniques
(able to unearth hidden knowledge from datasets). In this work, the problem of clustering and prediction of heterogeneous data is tackled by
a two-stage proposal. The first one concerns a new clustering approach which is based on a graph coloring method, named b-coloring. An extension
of this approach which concerns incremental clustering has been added at the same time. It consists in updating clusters as new data are added
to the dataset without having to perform complete re-clustering. The second proposal concerns sequential data analysis and provides a new
framework for clustering sequential data based on a hybrid model that uses the previous clustering approach and the Mixture Markov chain models.
This method allows building a partition of the sequential dataset into cohesive and easily interpretable clusters, as well as it is able to predict
the evolution of sequences from one cluster.
Both proposals have then been applied to healthcare data given from the PMSI program (French hospital information system), in order to assist
medical professionals in their decision process. In the first step, the b-coloring clustering algorithm has been investigated to provide
a new typology of hospital stays as an alternative to the DRGs classification (Diagnosis Related Groups). In a second step, we defined a
typology of clinical pathways and are then able to predict possible features of future paths when a new patient arrives at the clinical
center. The overall framework provides a decision-aid system for assisting medical professionals in the planning and management of clinical process.
Within my post-doctoral research stay at the "Meme Media Laboratory" of Hokkaido University in Japon (JSPS FY2007-PE07555).
Within the DOMECAD Project
"DOnnées Médico-EConomiques pour lAide à
la décision Distribuée" dealing with the problem of data analysis in French Helathcare Information System (Ph.D. thesis).
Within the European Project TArcHNA
"Towards Archeological Heritage New Accessibility" based on archeological documents clustering and retrieval.
The main objective consists of making a prototype which allows to define a typlogie of documents among the whole of documents in order to accelerate
the browsing task. (Co-supervision of Jérémie Legrand and Mohamed Azzaoui).
