Skip to main content

Dots and Dashes: The Only Non-Alphabetical/Numerical Characters Allowed in Usernames on Social Media

Ampersands have different meanings in URLs and in the HTML that transforms the content from a website into the structure of boxes/lists/tables/links/forms that you see in your web browser. In HTML, quotation marks have a special significance when they are used outside of the page's plain text and inside links or forms.

Photo by Ilya Pavlov on Unsplash

When you're programming things like websites you often need to take data from one system and essentially turn that into part of the code for another system. That can get really messy though when that data could look just like the code you're mixing it into. 

What makes things especially difficult with websites is that early web browsers tried to be really forgiving about minor mistakes in your HTML code, basically you could write whatever and they'd try their very best to do what you probably meant. 20 years later and we're building things like Reddit and Google Docs in our web browsers yet trying not to break all the stuff people built 20 years ago and snippets/practices people have carried around since.

For example, you could write a blog post like "Paste &copy your way to coding success!!1" for a website. The website keeps what you wrote, special characters and all, in a database from which it can get it later when people search by author/date/category, etc.

Now you create a page where people can see a list of all the entries, and you want the title to stand out, so you add some code like this that separates the date with a ":" and utilizes the HTML tag to make some text bold on the title:

"{print("$date: <strong>$title</strong>")}"

There are two layers of code here. On the outside, there's the programming language, which you're using to connect the database and web browser (the "print" command, the parenthesis+quotes used to contain what you want to print, and the placeholders beginning with "$" signs to allow you to put in bits from your database). 

Then you're attempting to print HTML code to make the text seem fancy when it reaches the web browser. But what if you wanted to surround the page title with (") marks? What if you want to put micro$oft in there as well, without the programming language treating "$oft" as another placeholder?

Doesn't that sound perplexing and vexing? Wait a minute, it gets better. The words the user wrote in for a title may also contain HTML code, so that's a third layer to be concerned about.

Here is the HTML code that may be printed:

"{2018-12-06: <strong>Paste &copy your way to coding success!!1</strong>}"

Here's what you'd see on the website after that: 2018-12-06: {Copy and paste your way to coding success!!1}

What is going on? "&copy;" turns out to be HTML code for creating a copyright sign. But, because we wanted HTML to be simple in the 1990s, "&copy" works most of the time.

Ok, simple fix: just change every {&} in someone's title with {&amp} before saving the post title to our database—the that's shortened but still functional HTML code for a literal {&}.

But wait, we already have a bunch of post titles saved without that, so when we generate the HTML from the database lets check for any "&" symbols that aren't already part of "&amp"; and then turn those into "&amp"; on the fly. Did you spot the bug? Now the the existing post comes out in HTML as 

"2018-12-06: <strong>Paste &amp;copy your way to coding success!!1</strong>" and renders right but the same title posted again comes out "2018-12-06: <strong> Paste &amp;ampcopy your way to coding success!!1</strong>"

This is a somewhat contrived/dramatized scenario, but it's very much what web engineers do all day, and you can see how the wires get crossed quite frequently.

Comments