Article content

Scraping the web

If you provide any kind of content on the web, you're exposing your intellectual property to theft. This is something we have to accept in trade for the freedom we have to publish anything and have it seen by anybody. But a simple setup in the way you publish your text can make it clear who the text belongs to.

Scraping the web

Say you're a happy blogger or content provider who provide unique content on your website. One day you find that everything you've posted is up on another site, linking to some page selling prescription drugs, and with a pseudonym as the author. Permalink for this article http://mirror.magicode.org/content/Scraping_the_web

This is not a good situation, and it happens quite frequently. It's quite easy to set up a system to scrape contents from other web sites. A simple server can process thousands of web sites a day, and submitting the contents to other webserver can be easily automated. This text was originally written for http://blog.magicode.org

You can't easily stop this. While you can set up bot traps, you risk running into problems with legitimate bots, like googlebot, which is something you probably don't want. If you see this notice on any site other than magicode.org, it's probably been lifted without consent

A solution

If you view the source code of this article, or any one magicode.org, you will find several notices and links in the main text. These are all wrapped in a span-tag that are hidden when browsing in browser mode, but are seen and copied by automated scripts.

What I do is simply create an array of notices, then insert them in text just before the end of paragraph tag (/p). I put them there so they won't interfere with the indexing of the sentences in the paragraph.

Comments? How do you do deal with this problem, especially with regards to legitimate bots

Discussion

Submit your comment

Text:

Your name:

Your email:

Show my mailaddress (spam protected)

Your website:

Show my website

Featured Article

PHP Variables and strings

A variable is a means of storing a value, such as a text string or a number. In PHP you do not have to declare your variable, as it's automatically declared when you set it. Since you do not need to declare the variable, you do not have to specify what kind of data it contains either.

Topics
Magicode's own open source project
From the forum / Latest comments
You may also want to to check out these links: sendanonmail.com, superstrongpassword.com.