BItmaps and negation

Just had another one of those moments where I really wasn’t thinking, then had to wonder why my code wasn’t working again. I had a very simple case where I needed a regex to match a character that was neither a word character or a space character. My fingers quickly typed:


And I then wondered why it didn’t work. For those not understanding that part, I basically was creating a character class that consisted of \W (non word chars) and \S (non space chars). What is different from that and what does work?


after all, in basic style, putting the caps char in a class \w vs. \W is just a negation, just like the ^ does inside a char class.

What happens is in how the regex engine defines a character class. What it’s really doing inside is creating a bitmap for that token. When creating it, the \W is expanded into those bits not in the word set. ^\w however puts in the word chars, and says, not them. With just one of these that would be fine, but when I added the second set, it busted, why?

When creating the bitmap with all the non word, and non space chars it blocked out some of the chars I really cared about. The way the cases are interpolated, the set’s were broken.

Thankfully I caught the problem quickly, but I could see how somebody could get rather confused about that.