[qutebrowser] Extend ad blocking to allow single URL blocking.

Mon Nov 5 10:17:50 CET 2018

Hey,

(re-adding the list to the Cc, I assume you accidentally didn't include
it in your reply)

On Sun, Nov 04, 2018 at 09:19:23PM +0200, Viacheslav Chimishuk wrote:
> > On Sun, Nov 04, 2018 at 07:16:45PM +0200, Viacheslav Chimishuk wrote:
> > > > > I propose to treat `~` prefix as a substring instead of domain
> > > > > name. For example, next file parses to substrings list.
> > > >
> > > > Again, use URL patterns. I will not accept any contributions inventing
> > > > yet another URL format, no matter how simple. Let's be consistent with
> > > > what the rest of qutebrowser uses. With URL patterns, I think they
> > > > should even be backwards-compatible with the "one host per line" format.
> > >
> > > I guess implementing Chrome's URL patter syntax here can be harder to
> > > implement. Ok, I'll try to implement it the way you proposed.
> >
> > It's as simple as "urlmatch.UrlPattern(pattern).matches(url)".
> 
> The only thing I'm not sure about is a simple loop with calling for
> the every pattern. In case of a thousands of patterns it should be
> very slow. I was thinking more about compiling patterns to some kind
> of tree, so match complexity for a single URL will be log(n) instead
> of n.
> Looks like you are fine if we can start from naive looping
> implementation first. Ain't you?

That's true, and a valid concern. Currently we just can test for
membership of a host in a set, which is O(1). Host blocking lookups are
done quite a lot, and there are over 50k blocked hosts with the default
blocklist, so having that O(n) instead of O(1) is probably a noticable
performance difference.

I added a benchmark test for adblock lookups:
https://github.com/qutebrowser/qutebrowser/commit/27d4796c2f9ced450ceaf3569de228532ce08df7

Currently looking up whether a host is blocked takes 3us (median).
When changing the lookup to:

  any(h == host for h in self._blocked_hosts)

instead of

  host in self._blocked_hosts

it instead takes 5.6ms, so almost 2000 times slower. Seeing that there
can be hundreds of adblock lookups for a page load, that's not
acceptable.

Like you say, the best way out would be having some kind of
UrlPatternSet class which can be more intelligent about lookups. I've
had some thoughts on that noted down somewhere already, but I can't find
them right now. Basically, you could hash patterns by their hosts, and
then if you want to check whether foobar.example.com is contained, make
some O(1) lookups for:

- foobar.example.com
- *.example.com
- *.com
- *

However, ensuring that this works efficiently for all imaginable cases
probably requires some more thought.

I'd instead propose the following:

- Add a UrlPatternSet class which we can extend like explained above
  over time.
- For now, just add a special case for patterns in the form of
  *://example.com/* (all schemes, constant host, all paths), by looking
  them up in a Python set. For all patterns which are more complex than
  that, fall back to iterating.

That solution should be quite straightforward for now, and result in a
huge speedup for the common case of adblocking lists. 

What do you think?

Florian

-- 
https://www.qutebrowser.org | me at the-compiler.org (Mail/XMPP)
   GPG: 916E B0C8 FD55 A072 | https://the-compiler.org/pubkey.asc
         I love long mails! | https://email.is-not-s.ms/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <https://listi.jpberlin.de/pipermail/qutebrowser/attachments/20181105/c71cd2e1/attachment.asc>