Project Details

Aquatic invasion data mining

Paper / KDD 2014
General Info

This paper explores aquatic invasions by integrating shipping network, ecological and environmental data. It is part of the interdisciplinary project "impacts of coupled changes in navigation infrastructure, global trade, climate and policy on ship-borne invasions".


Proposed an approach for solving the aquatic invasion problem via creative use of computational techniques and multiple data sources, thus leveraging data mining towards solving complex yet crucial problems towards social good.


The unintentional transport of invasive species (i.e., non-native and harmful species that adversely affect habitats and bioregions) through the Global Shipping Network (GSN) cause substantial losses to social welfare (e.g., annual losses due to ship-borne invasions in the Laurentian Great Lakes is estimated to be as high as USD 800 million). Despite the huge negative impact, management of such invasions still remains as an extremely challenging task, because the problem is perceived as too complex. Numerous difficulties associated with quantitative risk assessments (e.g., inadequate characterizations of invasion processes, lack of crucial data, large uncertainties associated with available data, etc.) have hampered the usefulness of such estimates in the task of supporting the authorities who are battling to manage invasions with limited resources. We present here an approach for solving the problem at hand via creative use of computational techniques and multiple data sources, thus essentially illustrating how data mining can be used for solving crucial, yet very complex problems towards social good. By modeling implicit species exchanges via a graph that we refer to as the Species Flow Network (SFN), large-scale species flow dynamics are studied via a graph clustering approach that decomposes the SFN into clusters of ports and inter-cluster connections. We then exploit this decomposition to discover crucial knowledge on how patterns in GSN affect aquatic invasions, and then illustrate how such knowledge can be used to devise effective and economical invasive species management strategies. By experimenting on actual GSN traffic data for years 1997-2006, we have discovered crucial knowledge that can signifficantly aid the management authorities.

Download paper Download poster Download slides

Fig 1. Species flow between ports corresponding to vessel movements given in the LMIU 2005{2006 dataset. The edges represent the aggregated species flow between ports, where the color intensity is proportional to the magnitude of flow. Approximately 2300 paths with the highest species ow are shown.

Fig 2. The Six Major clusters of SFN during 2005-2006. Color of dots correspond to that in Fig. 3, and white dots are not included in any of the six major clusters. Major clusters remain largely unchanged for the duration of 1997-2006, and contain a significant proportion of total species flow between ports.

Fig 3. Illustration of evolution of major clusters during the period of 1997-2006. The clusters in alluvial diagram are ranked by aggregated flow within the cluster. Here, the columns 1997, 1999, 2002 and 2005 represent the major clusters of SFN generated from LMIU datasets for 1997-1998, 1999-2000, 2002-2003 and 2005-2006, respectively.

Fig 4. Intra-cluster invasion risk with regard to Singapore. Ports are colored by environmental similarity with regard to Singapore (0-6 or white-red for low-high risk). Ports that are grey are in different IRN subcluster and have significant different environmental conditions (such as near estuaries or in very cold places).

Fig 5. High risk inter-cluster pathways with regard to Singapore. Pathways are colored by environmental similarity with regard to Singapore (0-6 or white-red for low-high risk).