User Tools

Site Tools


doc:appunti:linux:sa:sanitizer

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
doc:appunti:linux:sa:sanitizer [2023/01/19 10:47] niccolodoc:appunti:linux:sa:sanitizer [2023/01/19 12:11] (current) – [Perl Unescaped left brace warning] niccolo
Line 7: Line 7:
 I use it as a personal mail filter in GNU/Linux mail servers, because it can be activated on a per-user basis, by the **Local Delivery Agent** called by **Postfix**. The LDA can be as simple as **procmail** or the more complex **Dovecot LDA with Pigeonhole Sieve Interpreter**. I use it as a personal mail filter in GNU/Linux mail servers, because it can be activated on a per-user basis, by the **Local Delivery Agent** called by **Postfix**. The LDA can be as simple as **procmail** or the more complex **Dovecot LDA with Pigeonhole Sieve Interpreter**.
  
-===== Perl Syntax Warning =====+===== Perl unescaped left brace warning =====
  
-The version included in Debian Bullseye contains a bug into the Perl code, which triggers the warning message:+The Sanitizer version included in Debian Bullseye contains a deprecated syntax into the Perl code, which triggers the warning message:
  
 <code> <code>
 +Unescaped left brace in regex is passed through in regex;
 +</code>
  
 +It turned out to be into the file **/usr/share/perl5/Anomy/Sanitizer/MacroScanner.pm**, at lines 120 and 127. Here the fix:
 +
 +<code perl>
 +$score +=  4 while ($buff =~ s/\000(ID="\{[-0-9A-F]+)$/x$1/i);
 </code> </code>
 +
 +<code perl>
 +$score +=  1 while ($buff =~ s/\000(ID="\{[-0-9A-F]+\}"|ThisWorkbook\000|PrivateProfileString)/x$1/i);
 +</code>
 +
  
 ===== The HTML MIME multipart problem ===== ===== The HTML MIME multipart problem =====
Line 19: Line 30:
 Several mail user agents nowaday compose email messages in HTML format, sometimes without including a text-only copy of the same message. Some agents include the HTML as a part of multipart [[wp>MIME]] message, correctly marked as text/html. Other agents compose the message body directly in HTML, without using the MIME multipart system. Several mail user agents nowaday compose email messages in HTML format, sometimes without including a text-only copy of the same message. Some agents include the HTML as a part of multipart [[wp>MIME]] message, correctly marked as text/html. Other agents compose the message body directly in HTML, without using the MIME multipart system.
  
-The Anomy Sanitizer uses several methods to detect the HTML parts into a message, relaying on the **Content-Type: text/html** or the **filename** of the MIME part (if specified). Once it detects an HTML part, it performs some operations on it, one of them is the match with a **regular expression** to confirm that it is actually an HTML text. If that regex test fails, the Sanitizer neutralize such part changing its content type from **text/html** to something like **application/ANTIVIRUS- 14789** (the type name is composed using the **msg_defanged** configuration option).+In some circumstances Sanitizer defang the HTML message or the HTML part (changing its content type); thus a modern email reader does not display it correctly. In the best case an **anonymous attachment** is shown, in the worst case **an empty message** is shown. 
 + 
 +The Anomy Sanitizer uses several methods to detect the HTML parts into a message, relaying on the **Content-Type: text/html** or the **filename** of the MIME part (if specified). Once it detects an HTML part, it performs some operations on it, one of them is the match with a **regular expression** to confirm that it is actually an HTML text. If that regex test fails, the Sanitizer neutralizes (defang) such part changing its content type from **text/html** to something like **application/DEFANGED-14789** (the type name is composed using the **msg_defanged** configuration option)
 + 
 +That behaviour is triggered by the **feat_files = 1** configuration option (enable filename-based policy decisions). 
 + 
 +Unfortunately the regex used by Sanitizer to detect an HTML part is very naive: it simply must contain this expression: 
 + 
 +<file> 
 +<html|<body|<p>|<b>|<i>|<br>|</a> 
 +</file> 
 + 
 +Notably the **Gmail** application nowaday (Jan 2023) composes the mail messages using only a **%%<div>%%** tag, thus fooling Sanitizer into //defanging// that part. 
 + 
 +I fixed the Perl code into **/usr/share/perl5/Anomy/Sanitizer/FileTypes.pm**, changing the regular expression in this way: 
 + 
 +<code perl> 
 +my $HTML = { 
 +    id         => "html", 
 +    risk       => $low, 
 +    name       => "HTML text file", 
 +    extensions => [ "html", "htm", "shtml" ], 
 +    mime_types => [ 'text/html' ], 
 +    magic      => [ ], 
 +    regexp     => '<html|<body|<div|<span|<p>|<b>|<i>|<br>|</a>', 
 +}; 
 +</code> 
 + 
 +It is also possibile to remove the ''regexp'' element of the dictionary, in this case Sanitizer will recognize an HTML part only by the content type or the filename.
  
-That behaviour is triggered by the **feat_files = 1** configuration option.+The customized perl module can be installed into **/etc/perl/Anomy/Sanitizer/FileTypes.pm**, without changing the file installed by the Debian package.
  
doc/appunti/linux/sa/sanitizer.1674121649.txt.gz · Last modified: 2023/01/19 10:47 by niccolo