Skip to main content

Natural language processing for analysis of unstructured data

Project Member(s): Piccardi, M.

Funding or Partner Organisation: Capital Markets Cooperative Research Centre - Capital Markets CRC (Capital Markets - CRC Scholarship (External))

Start year: 2016

Summary: The notion of unstructured data is increasingly utilised in the data science community to refer to all data that are not organised in a predefined manner and/or stored in conventional databases. Such data encompass text documents such as emails, tweets, meeting notes and other common business documents, but also extend to include audio, images, video and various types of metadata. Analysis of such unstructured data has the potential to release precious information for its owners; however, it is greatly challenged by the data’s heterogeneous nature and their unstructured formats and collection. This PhD research is funded by the Capital Markets CRC - Health Market Quality program - with a scholarship and additional funding. The CMCRC has signed up the Victorian Government’s Transport Accident Commission (TAC) as its industry partner on this project. The PhD student will collaboratively explore TAC's unstructured data with focus on text-based data. In a second stage, he will develop innovative natural language processing (NLP) techniques to extract useful information from their unstructured data that can benefit their managers and end users and also be valuable for the research community at large.

Publications:

Seifollahi, S, Bagirov, A, Zare Borzeshi, E & Piccardi, M 2019, 'A simulated annealing‐based maximum‐margin clustering algorithm', Computational Intelligence, vol. 35, pp. 23-41.
View/Download from: UTS OPUS or Publisher's site

Seifollahi, S, Piccardi, M & Zare Borzeshi, E 2017, 'A semi-supervised hidden Markov topic model based on prior knowledge', Data Mining, Australasian Data Mining Conference, Springer, Melbourne, VIC, Australia,, pp. 265-276.
View/Download from: UTS OPUS or Publisher's site

Keywords: unstructured data, natural language processing, machine learning

FOR Codes: Computer Software and Services not elsewhere classified, Health not elsewhere classified, Pattern Recognition and Data Mining