So, I do a lot of editing website data and HTML in Google Sheets. A lot of the time it’s a lot of old stuff that someone else populated, sometimes it’s been extracted from a vendor website, and it’s almost always in need of some sort of repair.
Recently I came across this beauty:
Which made me think about a way that I would first need to wipe all of the HTML. Usually, I would just do a Find and Replace for <br /> or <p> tags and knock them out one at a time, but as you can see from the image. This is a particularly special case, and I found this solution over on Stack Overflow that I’d like to share here for my own personal use and to hopefully help out anyone else who experiences this problem.
In the Stack Overflow post, they mention 2 ways of accomplishing this, but I’m only going to go over one of them (and modify it a bit):
=REGEXREPLACE(A1,"</?[^<>]*>","")
So looking at what Google says about this tag (REGEXREPLACE(text, regular_expression, replacement)), it’s pretty easy to see what we’re doing. Here, the text refers to the cell we’re searching for HTML in (here we used A1, but the cell in my example image would have been K14625), and replacement is telling it to be empty, but what’s going on in the regular_expression?
- </? = Finds all lesser than symbols to make them the starting point
- [^<>] = Negates the character class, causing it to match a single character not listed in the character class (greater than & lesser than).
- *> = Finds all greater than symbols to make that the stopping point (the asterisk helps it to ignore everything after the greater than)
If you go back into the original post, you’ll see that I removed the \S+ in the middle which I believe tells it to ignore whitespace, but that wasn’t relevant for what I was trying to accomplish, and my expression seemed to work just fine. If you’re experiencing problems, try it with the \S+ which can be found using the link above.