Skip to main content

Aggregation

STDistanceBased

The STDistanceBased class aggregates point data using spatio-temporal distance-based clustering, which is a simple technique that groups points that are within some given spatial and temporal thresholds. If this distance threshold is 30 kilometres and the temporal threshold is 30 days, for example, then each pair of points that are no more than 30 kilometres apart in space and no more than 30 days apart in time will be considered part of the same cluster. Once these clusters are established, their constituent variables are aggregated.

Importing the Class

The STDistanceBased class is located in GeoJikuu's aggregation.point_aggregators module:

from geojikuu.aggregation.point_aggregators import STDistanceBased

Coordinate Projection

GeoJikuu's aggregation classes assume that any input coordinates have already been projected to a linear coordinate system. In addition, the STDistanceBased class's aggregate() function requires the input distance to be in the same unit as the chosen projection system, so the projection system's unit conversion will be needed as well. For example:

from geojikuu.preprocessing.projection import CartesianProjector
cartesian_projector = CartesianProjector("wgs84")

data = {
    "lat": [34.6870676, 34.696109, 34.6525807, 35.7146509, 35.6653623, 35.6856905],
    "lon": [135.5237618, 135.5121774, 135.5059984, 139.7963897, 139.7254906, 139.7514867],
    "date": ["19/03/1990", "19/03/1991", "19/03/1992", "19/03/1993", "19/03/1994", "19/03/1995"],
    "value": [1, 2, 1, 5, 6, 3]
}

df = pd.DataFrame.from_dict(data)

results = cartesian_projector.project(list(zip(df["lat"], df["lon"])))
df["cartesian_coordinates"] = results["cartesian_coordinates"]
unit_conversion = results["unit_conversion"]
df.head()
latlondatevalue cartesian_coordinates
0 34.687068135.52376219/03/19901(-0.5867252094096281, 0.5760951446437298, 0.5690939403658662)
134.696109135.51217719/03/1991 2(-0.5865446454704655, 0.5761508222375707, 0.5692236896905267)
234.652581135.505998 19/03/19921(-0.5867908125787922, 0.5765169810552485, 0.5685989032948122)
335.714651 139.79639019/03/19935(-0.6201191587857982, 0.5241083019965597, 0.5837488472665938)
4 35.665362139.72549119/03/19946(-0.6198530442409449, 0.5251996831772221, 0.5830501662256676)
For more information, see: Projection Classes

Convert Dates to Timesteps

In the case of STDistanceBased, it is also necessary to convert any dates to timesteps before running the aggregate() function:

from geojikuu.preprocessing.conversion_tools import DateConvertor
date_convertor = DateConvertor(date_format_in="%d/%m/%Y", date_format_out="%d/%m/%Y")
date_convertor.date_to_days(date="10/05/1995")

df['date_converted'] = df['date'].apply(date_convertor.date_to_days)
df.head()
latlondatevalue cartesian_coordinatesdate_converted
034.687068135.52376219/03/1990 1(-0.5867252094096281, 0.5760951446437298, 0.5690939403658662)726544
134.696109135.512177 19/03/19912(-0.5865446454704655, 0.5761508222375707, 0.5692236896905267)726909
234.652581 135.50599819/03/19921(-0.5867908125787922, 0.5765169810552485, 0.5685989032948122)727275
335.714651139.79639019/03/19935(-0.6201191587857982, 0.5241083019965597, 0.5837488472665938) 727640
435.665362139.725491 19/03/19946(-0.6198530442409449, 0.5251996831772221, 0.5830501662256676)728005
For more information, see: DateConvertor

Creating a STDistanceBased Object

A STDistanceBased object is created by passing in a DataFrame, the label of the column that contains the coordinates, and the label of the column that contains the timesteps:

st_distance_based = STDistanceBased(data=df, coordinate_label="cartesian_coordinates", time_label="date_converted")

Aggregating

Once the object has been created, the inputted data can be aggregated using the aggregate() function. As input, the aggregate() function takes a spatial distance threshold, a temporal distance threshold, and the desired aggregate type (e.g., mean, sum, etc). To convert the input spatial distance from kilometres to the projection system's unit, divide by the unit_conversion variable. The temporal distance threshold is given in days. Here is an example of aggregating the points by mean based on a distance threshold of 100 kilometres and a temporal threshold of 365 days:

st_distance_based.aggregate(spatial_distance=100/unit_conversion, temporal_distance=365, aggregate_type="mean")
Output: Aggregated 6 points into 3 clusters.
valuedate_convertedmidpointcountmbrtemporal_extent
01.500000726726.5(-0.5866349274400469, 0.5761229834406503, 0.5691588150281965, 726726.5)20.000115(726544,726909)
11.000000727275.0(-0.5867908125787922, 0.5765169810552485, 0.5685989032948122, 727275.0)10.000000(727275,727275)
24.666667728005.0 (-0.6199685163109833, 0.5246975627357471, 0.5833791301682658, 728005.0)30.000712(727640,728370)

In this example, the aggregation resulted in three clusters. The 'value' column represents the mean 'value' for the points encapsulated within that cluster, the 'date_converted' column represents the temporal midpoint of the cluster, the 'midpoint' column represents the spatial midpoint (mean coordinate) for that cluster, 'count' represents the number of points encapsulated by the cluster, and 'mbr' represents the minimum bounding radius of the circle that encapsulates all contained points.

You will notice that the 'midpoint' and 'mbr' columns pertain to the projected coordinate system. The midpoint (in Cartesian form) can easily be converted back to latitude and longitude, and the minimum bounding radius can easily be converted to kilometres:

results = st_distance_based.aggregate(spatial_distance=100/unit_conversion, temporal_distance=365, aggregate_type="mean")
results["midpoint"] = cartesian_projector.inverse_project(results["midpoint"])
results["mbr"] = results["mbr"] * unit_conversion
results
Output: Aggregated 6 points into 3 clusters.
valuedate_convertedmidpointcountmbrtemporal_extent
01.500000726726.5(34.69158843701244, 135.51796991635013)20.730155(726544, 726909)
1 1.000000727275.0(34.6525807, 135.5059984)10.000000(727275, 727275)
24.666667728005.0(35.68857144588458, 139.7577815841674)34.534667(727640, 728370)

You will also notice that the 'date_converted' column, which contains the temporal midpoint of each cluster, is still represented by timesteps. It can be converted back to date via:

results['date_converted'] = results['date_converted'].apply(date_convertor.days_to_date)
results
valuedate_convertedmidpointcountmbrtemporal_extent
01.50000017/09/1990(34.69158843701244, 135.51796991635013)20.730155(726544,726909)
11.00000019/03/1992(34.6525807, 135.5059984)10.000000(727275,727275)
24.66666719/03/1994(35.68857144588458, 139.7577815841674)34.534667(727640,728370)