Ransomware is a malware that has gain attention over the past few
years and it is something that is seen as a threat to the personal information
of users as an individual to an enterprise, it basically, deprives the user
from accessing files from the infected system demanding for a ransom in return
to grant access as bitcoins. Its damage costs are predicted to hit $11.5B by
2019. In an attempt to protect user’s vital data from this fatal attack, in
this work, we deployed more robust, efficient, accurate and newer technologies
that could detect malicious activities on a system by using different indicators, which includes analyzing
user’s data on Data processing platforms like Hadoop, R and Machine Learning
techniques. These were tested with an aim to alert the user before a
significant amount of information is lost, i.e., it narrows the data loss and
also reduces the number of erroneous results by providing the user with details
that could be used to flag it as either safe or unsafe.
The Authors. Published by Elsevier B.V.
Peer-review under responsibility of the scientific committee
of the International Conference on Computational Intelligence and Data Science
Ransomware, as stated above is a kind of mal-function which inhabits
the user to access his/her files and demand a ransom in exchange for decrypting
the files. These malicious programs mostly spread by tricking the users to
click on some pop-ups which may have appeared to be safe and sound. Once such a
spurious popup is clicked, a ransomware program gets installed to the system
and finds files that bear extensions like JPG, XLS, PNG, PPT, DOC, etc. These
files are generally important ones in any computer system. The installed
program forces a user to make a definite, variable sum of payment to the
perpetrators generally in the form of cryptocurrencies. The team responsible
for spreading ransomware makes sure to keep their identity secretive and in
order to do so they make sure that no one can keep a track of the payment they
took. Attackers generally uses Tor protocol to hide their location. Along with
this, ransomwares also spread via traditional mailing system. More than 60
percent of ransomware spreads via an email (specifically as a Microsoft Word
document or a .ZIP file). According to Cisco Systems’ 2017 Annual Cybersecurity
Report, 65 percent of email traffic is spam and about 10 percent of the global
spam observed in 2016 was classified as malicious.
damages due to ransomware:
Businesses as well as individuals need to be fully aware of the threat
posed by ransomware and make cybersecurity a top priority. According to
Kaspersky, in an interval of 2 minutes at least 3 companies get hit by one type
of ransomware or the other. Moreover there has been a three-fold increase in
attacks over the business in the year 2016. Ransomware attacks can always
result in disrupting some important systems and can destroy some confidential
data. A damage of $325 million was accounted as a damage due to ransomware
according to some reports from Microsoft. Cybersecurity Ventures predicted cost
of damage to be $1 Billion in 2016, and there is an annual growth by 3.5 times
in ransomware, in reference to Annual cybersecurity report by cisco in 2017.
Other than financial impacts, there is permanent or temporary loss of
sensitive or proprietary data. Moreover, the regular operations get disrupted.
On an organizational level, it potentially harms the organization’s reputation.
Even on paying the ransom, one may not guarantee that the encrypted files will
be decrypted. In addition, it cannot be said that the malware infection has
been completely eradicated from the computer system.
Some information in
relevance to the work:
Ransomware variants can be
loosely classified into the following three categories:
1. These kind of Ransomware
attacks can be called Denial of Service Attacks since the legitimate user is deprived
of working over his files or performing any other activities till a particular
code is texted to an SMS provider who charges the user with high-end rates.
Sometimes the attack comes as if its from some legal authorities or from the
user’s OS operators. Victim can be asked to pay via online payment systems.
These kind of attacks do not generally damage the files inside the system.
Below is the image of one such kind of ransomware that we developed.
2. Another type of Ransomwares are the ones that might or might not
lock access to the system but will encrypt all personal/vital and important
data. Since the malware is made of complicated encryption algorithms, it’s
difficult to decrypt them and retain the access without paying to the attacker
hefty amounts of ransom to obtain the decryption key. They might delete files.
3. This type of ransomware are believed to be most dangerous, because
in addition to the above to damages, it also infects the booting mechanism of
an operating system. The victim then follows the instructions that the Ransom
note provides on switching on the system.
When these types of malware enter into a device, it is often difficult
to detect them and respond well in time since there are a good no. of upgraded
and differentiated variants that come into existence every day each of which
portray different behavior, thus making it difficult to design a tool that
could resist something that changes its characteristics rapidly and behaves
differently every time. Moreover it is difficult to differentiate them from
other safe soft wares that sometimes would behave the way a ransomware
infection would. In our work, the focus is on detecting the files causing the
first and second type of Ransomware attacks.
Therefore, in this work
contribution has been made towards:
1. Identifying four indicators:
All these indicators were identified on the basis ransomware behavior to
a system containing files. Each of these indicators were designed to analyze
particular conduct in terms of finding destructive content from target
files/source codes or analyzing the type of files. Other indicators aim to keep
a check on data integrity, uncommon read/write behaviors and file deletions.
Each of these indicators will be explained in the next section.
2. Protect from unseen malware attacks: Because of using more dynamic
techniques of Machine Learning, its classification and prediction models, it is
easier now to immediately detect any type of malware that the system has not
3. Minimizing the amount of data loss: All these indicators when made
to work together, they will be able to alert the user at the early stage of
annoying activities that come in existence and also of whose causing that to
4. Safely differentiate between benign and harmful files: After the
files are checked for harmful content or destructing actions on the user’s file
system, which trigger these indications accordingly, the files can be further
analyzed into ‘safe’ or ‘unsafe’ category by using classification algorithm
(Hypothesis testing) and giving the control to the user to review its contents
before classifying each file.
III. Detection Mechanisms and
1) Analyzing files for malicious contents:
based programming framework Hadoop has been used to analyse the contents of
files in the documents directory consisting of 150 files. The directory
consisted 70% of XML files, 10% of xsl and another 20% of source code files
generated from various application programs on the computer system.
is chosen as a platform to perform these operations for a number of reasons:
a) It conducts a pre-processing
of large data sets by removing the unrelated and excluding the less frequent
words which results in faster data processing as compared to traditional ways
of accessing files and searching for patterns.
Since the traditional databases and warehouses reads the data in 8k or
16k block sizes, it becomes inefficient while processing large data sets. But Hadoop on the other hand has proved to
work best in case of semi/unstructured data sets.
Most of Hadoop’s algorithms use two-stage paradigm one of which is
used in our work (MapReduce), makes it easy to process when the data set is too
approach the map reduce algorithm was deployed on the above described set of
input files. A rigorous search for a string or particular words believed to be
malicious was made which successfully resulted in detecting the location of
these words specifying the path of file in which they were most frequently
used. It shows various forms of occurrence of the same word as shown below.
running this algorithm several times on all the files for different words, a
value is generated for each word that helps us identify the level above which
it should trigger this indicator stating malicious files might be present in
the documentary by giving the location of such files.
Fig 1: A
category 1 type Ransomware that shows false notices.
2) Data Integrity:
integrity is of course a fundamental component of information security. Like conventional filesystem, Hadoop HDFS also
offers filesystem consistency and integration check. The command – feck was
used to know if the HDFS system has any corrupt blocks. Data integrity is
breached when any unintentional changes are made to the files containing
critical information. Since all ransomwares described in category 2 above cause
data integrity attacks as all they do is make changes to the data using complex
encryption algorithms, therefore this indicator plays a major role in detection
To keep a check on these
kind of activities, the MD5 algorithm was implemented in C# for all the files
in the directory with the ‘salt’ value that’s known only to the sender and the
receiver. The output of this algorithm generates hash codes/values.
Hash values helps in
checking Data integrity because of its following properties:
? The length of the hash value determined does not
depend on the size of the file. The MD5 algorithm produces a hash value of 128
? Even if the two files differ only by a single bit,
the files will translate into completely different hash codes making it
impossible to discover a pair of files that generate the identical hash values.
? The same hash value is generated every time this
algorithm is run on it.
? Given the message with ‘salt’ value, it is
impossible to discover the original contents of the message.
Therefore carefully examining the output after every
action performed on these files by some unknown programs can guarantee us of
the data being intact if they produce same hash values. Hence, a large number
of different hash values being produced at faster rates is an indicator of
malicious programs attempting to encrypt the data.
A safer solution in Hadoop is to maintain duplicate copies
on the Hadoop distributed File Systems, i.e., data redundancy. But these blocks
get corrupted too. After getting all the information related to those files and
blocks, we run commands to locate the server in loop till all the files in the
corrupted list are located. Then
checking the data node logs accurately traces out where the problem occurred.
Fig 2: An example of MD5 algorithm deployment that
produces unique hash code.
3) A machine learning approach:
In this particular approach, an attempt is made to use algorithms for
classification (decision tree) to analyze and differentiate between the given
set of malicious and benign files. The dataset analyzed is from UCI repository
– Detect Malicious Executable (Antivirus) Data Set. Here, the training file is created with 100
plus non malicious examples and 250+ malicious samples. A sign convention of +1
stands for non-malicious dataset and -1 for malicious dataset are used. Based
on a rigorous comparison and analysis (as mentioned in figure 1), 500 most
commonly occurring features are extracted. On the other hand, the testing file
consists of an unknown malicious executable and carry out a similar procedure
(refer figure 2). On using decision tree techniques on this dataset we
categorize the probability of a file being malicious or not.
We believe, this approach is the most rigorous and robust of all other
techniques used, since it not only helps in classifying the existing files by
attentive analysis of system behavior, but because of its ability to self-learn
without being explicitly programmed, when these algorithms are exposed to an
unknown set of input characteristics, they can predict if the new set of files
are malicious or benign. This technique thus, helped in reducing malware
threats to a significant extent.
4) File activity monitoring:
While detecting ransomware, one would want to know which
files are being encrypted, and would want to alert in case of any privileged
user access to sensitive files are being made. For this reason it is important
to monitor the event log or the file system log on the Operating System. In
this case the System Log Viewer or Syslog was used in ubuntu 16.04 to view the
File activities. Linux logs a large amount of events to the disk, where they
are mostly stored in the /var/log directory in plain text. Most log entries go
through the system logging daemon, syslog, and are written to the system log.
These system logs usually consists of a timestamp, user name, file name, operation
(create, read, modify, rename, delete, etc.), and a result (success or
failure). Therefore, this information was analyzed by taking out samples from
the system log to determine a threshold value for acceptable operations on the
file. Any activity taking place in an amount larger than the expected or
threshold value is flagged and its details are presented to the user to review
if it is harmless or malicious.
Once all of the above indicators are
tested individually, the idea is to put the whole directory for a test through
all 4 indicators and record their observations following which, a sample data
set containing all 4 indicators and their respective outputs were recorded.
A convention of +1 if the indicator is
triggered and -1 if it is not triggered was used and was passed through a
popular Machine learning technique called Decision Trees. The basic version of
ID3 algorithm of classification was deployed which uses the principle of
generating decision tree from a fixed set of training instances.
In initial illustrations, We took it as a
We took it a Yes if more than 2 indicators were triggered and No in case of 2
or less indicators. The resulting tree is used to classify future samples. Illustration
has several attributes that belong to either 1 or 2. The leaf node bearing the
name of a particular class whereas the non-leaf is a node that explains decision
tree.One among the nodes as to which one is a part of decision nodes.
This algorithm could easily help us
decide if a file received is harmful or a one that’s harmless.
Scope of Improvement
It is important to realize here that
safeguarding and securing information from any type of Malwares particularly
Ransomware means always putting endless efforts and updating the mechanisms as
and when any vulnerability is found in the existing techniques. There is always
a possibility of evasion of these indicators which would result in most of the
Malicious activities being marked safe thereby letting them slip through our
On carefully analyzing our work, we
expect the following things to be embedded in our future versions:
? To also include mechanisms that would protect data privacy before even
entering the system, i.e., analyzing network data and using robust searching
tools like elastic search to be deployed over the network.
? To be able to work on more
unstructured data, as most forms of malwares that peek into a computer system
comes with different forms of text and media.
? To improve the dynamic aspect of this mechanism which would access,
detect and delete the harmful content.
? And of course, to make it work even faster and with accurate results
which means reduced false positives.
Victims of Ransomware attacks often are
left with no other option than to pay the ransom that is being demanded. In
this paper, we made an attempt to limit such attacks by developing a set of
indicators related to system behavior when it is attacked by such malicious
programs. We observed that every ransomware prone program did trigger all
indicators, even the case where it didn’t trigger some of them were later
classified as malicious since we passed them through a classification
algorithm. Hence, the Union identification makes sure that no unsafe file
escapes through the carefully designed traps, on the other hand, it also reduces
the number of false positives as it gives users the chance to review the
content in case the classification becomes narrow.
since analytics was a major aspect of this work, we conclude that it is much
more faster, efficient and robust because it can certainly process larger
amounts of data, can process unstructured data and is faster than the
conventional searching techniques which helped in creating an early-warning
system to the user and reduced the amount of data loss.