A large amount of measurement data shows that the end-to-end distance in the Internet can be approximated by delay, and the direction is determined by the path. According to this basic fact, the geolocation method trains neural networks based on the combination of delay and path. Because of the ISPs (Internet service providers) of the covert communication entities are unknown and ISPs in some countries do not fully realize the interconnection within the city, when geolocating the covert communication entities, we need to use landmarks around the entities to ensure the consistency of training samples.
The method is divided into six parts: obtaining landmarks, vectors construction of landmarks, acquisition of training sets, training neural networks, vector construction of the covert communication entity, and entity geolocation. Figure 3 shows the frame diagram of the method.
The specific steps of the method are as follows:
- 1)
Obtaining landmarks. With the knowledge of the covert communication entity IP, get landmarks around the entity.
- 2)
Vector construction of landmarks. Deploying n probes P1, P2, …, Pn, acquiring the delay and path from the probes to landmarks. Then, encoding the path with hop-hot path code method to get the vectors of delay and path
$$ {V}_k=\left({d}_{k,1},{d}_{k,2},\dots, {d}_{k,n},{C}_k\right). $$
(1)
where Vk represents the vector of delay and path of the k th landmark, and dk, i represents the delay from the probe i to the landmark k. Ck represents the encoded path vector of the landmark k.Acquisition of training sets. Use (1) to cluster the landmarks, and then use the latitude and longitude of the landmarks to cluster the landmarks. Take the intersection of the two clustering results to obtain the training sets, denoted as
$$ F=\left\{{S}_1,{S}_2,\dots, {S}_q\right\}. $$
(2)
where Si is the i th training set.
- 3)
Neural networks training. Train the neural network for each training set. Taking (1) in the training set Si as input, and the latitude and longitude thereof as output, obtaining a well-trained neural network.
- 4)
Vector construction of the covert communication entity. Acquiring the delay and path from the probes to entity. Encoding the path to get the vector of delay and path of the covert communication entity
$$ {V}_T=\left({d}_1,{d}_2,\dots, {d}_n,{C}_T\right). $$
(3)
where VT represents the vector of delay and path of the covert communication entity, and di represents the delay from the probe i to the entity. CT represents the encoded path vector of the entity.
- 5)
Geolocation of covert communication entity. Calculate the similarity simi from (3) to Si.Setting the threshold U, and let \( M=\underset{i=1,\dots, q}{\max}\left({sim}_i\right) \) , if M ≥ U, inputting (3) into the neural network constructed by Si to obtain its latitude and longitude; otherwise, ending the method.
Among them, hop-hot path coding method, acquisition of training sets, and geolocation of covert communication entity are the important parts of the method, which will be described in detail in the following subsections.
Hop-hot path coding method
The path from probe to the entity IP is composed of router sequence, such as <probe, router1, router2, …, routern, entity IP>. One-hot coding can be used to measure the similarity between paths by judging whether routers in the paths, but the paths are sequential, one-hot coding cannot express this sequential well, so it is not very reasonable to express the paths by one-hot encoding. In order to better measure the degree of similarity between paths, this paper proposes a path coding method: hop-hot path coding. It can make the coded path vector directly into the machine learning model as a feature or compare similarity.
The process of coding is as follows. Firstly, stable router paths are obtained from probes to all landmarks, and all router sets are obtained. Then, the one-hot coding is used to encode each stable router path to obtain the path vector. After that, the path vector is quantized by hop number. Finally, the entity’s router path vector is quantized. The details are as follows:
Step 1. Building router path set. n probes are used to measure m landmarks, then, a stable router path set whose size is n × m obtained. The set is recorded as
$$ \mathbf{E}=\left\{\begin{array}{l}{p}_{1,1},{p}_{1,2},\dots, {p}_{1,n}\\ {}{p}_{2,1},{p}_{2,2},\dots, {p}_{2,n}\\ {}\dots \\ {}{p}_{m,1},{p}_{m,2},\dots, {p}_{m,n}\end{array}\right\}. $$
(4)
where pk, i is the measured router path from the ith probe to the kth landmark.
Step 2. Extracting routers. All routers in the router paths from the ith probe to m landmarks are extracted. The extracting result is
$$ {\mathbf{O}}_i=\left\{{r}_{i,1},{r}_{i,2},\dots, {r}_{i,{l}_i}\right\}. $$
(5)
where ri, j is the jth router in the measured paths from the ith probe to m landmarks, and the order is inessential. li is the number of routers appearing in the measured paths whose source is the ith probe. The feature space of path coding is consistent to all Oi, and the feature space is recorded as
$$ \mathbf{L}=\left\{{\mathbf{O}}_{\mathbf{1}},{\mathbf{O}}_{\mathbf{2}},\dots, {\mathbf{O}}_{\mathbf{n}}\right\}. $$
(6)
That is equivalent to
$$ \mathbf{L}=\left\{\left\{{r}_{1,1},{r}_{1,2},\dots, {r}_{1,{l}_1}\right\},\left\{{r}_{2,1},{r}_{2,2},\dots, {r}_{2,{l}_2}\right\},\dots, \left\{{r}_{n,1},{r}_{n,2},\dots, {r}_{n,{l}_n}\right\}\right\}. $$
(7)
where n is the number of probes.
Step 3. Building landmarks’ router path vector. For landmark k, according to the router paths from each probe to landmark k, the landmark is coded in feature space L. The coding result is recorded as
$$ {\mathbf{C}}_{\mathbf{k}}=\left(\ {V}_{1,1,k},{V}_{i,j,k},\dots, {V}_{n,{l}_n,k}\right). $$
(8)
The value of Vi, j, k is donated as
$$ {V}_{i,j,k}=\left\{\begin{array}{l}\beta, \kern1.25em if\ {r}_{i,j}\ not\ in\ {p}_{i,k}\\ {}{H}_{i,j,k},\kern1.25em if\ {r}_{i,j}\ in\ {p}_{i,k}\end{array}\right.. $$
(9)
where Hi, j, k is the number of hops from the router ri, j to landmark k, and β is a control parameter whose value is greater than the length of px, y, (1 ≤ x ≤ m, 1 ≤ y ≤ n).Step 4. Building the router path vector of the covert communication entity. As the same of landmark, the coding result of entity in feature space L is recorded as
$$ {\mathbf{C}}_T=\left(\ {V}_{1,1,T},{V}_{i,j,T},\dots, {V}_{n,{l}_n,T}\right). $$
(10)
The value of Vi, j, T is donated as
$$ {V}_{i,j,T}=\left\{\begin{array}{l}\beta \kern1.5em if\ {r}_j\ not\ in\ {p}_{i,T}\\ {}{H}_{i,j,T}\kern1.5em if\ {r}_j\ in\ {p}_{i,T}\end{array}\right.. $$
(11)
where Hi, j, T is the number of hops from the router ri, j to entity T. Meanwhile, if a router is in the router path from probes to entity but not in the router paths from probes to the landmarks, this router would not be considered.
Acquisition of training sets
In the actual Internet environment, there are multiple ISPs in some countries and regions. Even if the landmarks’ locations are close, there may also be large gaps in vectors of delay and path between landmarks. If all the landmarks are used as the training set to train the neural network, the mapping relationship learned by it will not be strong, and the geolocation reliability is hard to guarantee. Therefore, the training set needs to be filtered so that the delays, paths, latitude, and longitude of the landmarks in each training set are similar. The specific steps are as follows:
Input: Vectors of delay and path of landmarks, longitude, and latitude of landmarks
Output: Filtered training sets
Step 1. Using (1) to perform K means clustering on the landmarks, wherein k value is iterated from small to big, calculating the contour coefficients of the clustering, selecting the k value corresponding to the maximum contour coefficient, recording the clustering set as K = {D1, D2, …, Dk}.Step 2. Using the latitude and longitude in the landmark set to cluster all the landmarks, in terms of the number of clusters, also selecting the value corresponding to the maximum contour coefficient and recording it as h, and recording the clustering set as Q = {L1, L2, …, Lh}.Step 3. Calculating F = K ∩ Q and recording the final set of categories as F = {S1, S2, …, Sq}.At this time, the delay, path, latitude, and longitude of the landmarks in each training set are similar. The neural network is trained by using the landmarks in each training set, and the mapping between delay, path, and location will be more reliable.
Geolocation of covert communication entity
After training the neural network for each training set, when geolocating the covert communication entity, it is first necessary to judge the training set to which the entity belongs. Then, the vector of delay and path is input into the neural network trained by the training set to obtain the latitude and longitude of the entity. Specific steps are as follows:
Input: The vector of delay and path of the entity
Output: Longitude and latitude of the entity
Step 1. Calculate the cosine similarity between the center of Di and (3), and choose the Di with the highest cosine similarity between center and (3) as the Di to which the entity T belongs.
Step 2. Calculate the cosine similarity between landmarks in Di and the entity. Find the landmark whose vector of delay and path is most similar to the entities’ vector. Record the training set to which the landmark belongs as Sj, and use Sj as the training set of the entity. The vector similarity between landmark and entity is recorded as M.Step 3. Setting the threshold U, and if M ≥ U, using the neural network formed by the training set Sj to geolocate the entity; otherwise, ending the method.