remove malicious script tags from file

Here’s a small Windows Forms application that I created to automate removal of malicious SCRIPT tags inserted into some web files. [more] (or in general - even non malicious scripts).

Of course, you can always do this manually but if we’re talking of hundreds or thousands of files, it will be one heck of a job.

The idea is to:

  1. retrieve list of all script tags in all files in a given folder (including subfolders)

  2. list scripts found

  3. select the scripts to remove - ALSO, if the script contains line break, select it then click on the [View Script Detail] button. Also note that the checkedListBox is not set to check on click

  4. set a folder to save the “cleaned” file

  5. then process (remove the selected scripts and they will be saved on the Target Folder - retaining their folder hierarchy)

That’s it

Here’s a glimpse at the “core” code for the application. Note that I employed recursion inside of the faster, better performing stack approach for simplicity.

The complete source code can be downloaded below. Along with the output (executable).

Search a root folder (and subfolder and files) for script tags (and their contents of course):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
// recursive
private void SearchFolder(string newRootFolder)
{
    DirectoryInfo rootDir = new DirectoryInfo(newRootFolder);
    foreach (FileInfo fi in rootDir.GetFiles())
    {
        SearchFile(fi);
    }

    foreach (DirectoryInfo di in rootDir.GetDirectories())
    {
        SearchFolder(di.FullName);
    }
}

private void SearchFile(FileInfo fi)
{
    using (StreamReader sr = new StreamReader(fi.FullName))
    {
        string fileContent = sr.ReadToEnd();
        MatchCollection ms =
            Regex.Matches(
                fileContent,
                @"<script([^>]*)>.*?</script>",
                RegexOptions.Singleline); // handle line breaks inside script tags

        foreach (Match m in ms)
        {
            if (checkedListBox1.Items.Contains(m.Value))
                continue;

            checkedListBox1.Items.Add(m.Value);
        }
    }
}

Process a root folder (and subfolder and files), check if a script marked as to be removed is found, replace it with empty string (effectively removing it) then save the file on the Target Folder:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
// recursive
private void ProcessFolder(string newRootFolder)
{
    DirectoryInfo rootDir = new DirectoryInfo(newRootFolder);
    foreach (FileInfo fi in rootDir.GetFiles())
    {
        ProcessFile(fi);
    }

    foreach (DirectoryInfo di in rootDir.GetDirectories())
    {
        ProcessFolder(di.FullName);
    }
}

private void ProcessFile(FileInfo fi)
{
    string path = fi.FullName;
    using (StreamReader sr = new StreamReader(path))
    {
        string fileContent = sr.ReadToEnd();
        StringBuilder sb = new StringBuilder(fileContent);
        int origLength = sb.Length;
        foreach (string stringToRemove in selectedScripts)
        {
            sb.Replace(stringToRemove, String.Empty);
        }

        if (sb.Length != origLength)
        {
            string newFilePath = path.Replace(textBox1.Text, textBox2.Text);
            string newFileDirectory = Path.GetDirectoryName(newFilePath);
            if (!Directory.Exists(newFileDirectory))
            {
                Directory.CreateDirectory(newFileDirectory);
            }

            string newFileContent = sb.ToString();
            using (StreamWriter sw = File.CreateText(newFilePath))
            {
                sw.Write(newFileContent);
            }
        }
    }
}

Files for Download:

ScriptRemover_Executable.zip (11.11 kb)

ScriptRemover_Source.zip (10.57 kb)

Hope this helps in one way or another and as usual, feel free to make comments/corrections. This has been haphazardly made but tried my best to make it useful and working.

Note: This has some known limitations (due to the regex expression used):

  1. script tags has spaces like <script>abc</script > (note that the end script tag has a space before the closing bracket)

  2. self closing script tags like <script src="url" />

as there was no need for me to handle these cases, however should you need to handle them, feel free to drop me a message and I’ll try to help out.

By the way, Happy 2009 everyone!

Built with Hugo
Theme Stack designed by Jimmy