User:Dispenser/Checklinks
Original author(s) | Dispenser |
---|---|
Initial release | July 15, 2007 |
Operating system | Client: Any Web browser |
Platform | web pywikipedia |
Type | Link checker, Wikipedia Tool |
Website | Checklinks or direct link[note 1] or direct IP link |
Checklinks is a tool that checks external links for Wikimedia Foundation wikis. It parses a page, queries all external links and classifies them using heuristics. The tool runs in two modes: an on-the-fly for instant results on individual pages and project scan for producing reports for interested WikiProjects.
The tool is typically used in one of two ways: in the article review process as a link auditor to make sure the links are working and the other as a link manager where links can be reviewed, replaced with a working or archive link, add citation information, tagged, or removed.
As of September 2020, the Checklinks service is only usable via direct IP link. |
Background
[lemba | kulemba source]Link rot is a major problem for the English Wikipedia, more so than for other websites, since most external links are used to reference sources. Some of the dead links are caused by content being moved around without proper redirection, while others require micropayments after a certain time period, and others simply vanish. With perhaps a hundred links in an article, it becomes an ordeal to ensure that all the links and references are working correctly. Even featured articles that appear on our main page have had broken links. Some Wikipedians have built scripts to scan for dead links. There are giant aging lists, like those at Wikipedia:Link rot, which was last updated in late 2006. However, these scripts require manual checking to see if the link is still there, searching for a replacement, and finally editing the broken link. Much of this work is repetitive and represents an inefficient use of an editor's time. The Checklinks tool attempts to increase efficiency as much as possible by combining the most used features into a single interface.
Running
[lemba | kulemba source]Type or paste into the input box the URL or page's title or a wikilink. All major languages and projects are supported. MediaWiki search functionality is not supported in this interface at this time.
Interface
[lemba | kulemba source]Tools ▼ Save changes Jimmy Wales | |||||
---|---|---|---|---|---|
Ref | External link | HTTP | Analysis | ||
37 | Wikipedia Founder Edits Own Bio (info) [wired.com] accessdate=2006-02-14
publisher=Wired work=Wired News |
302 | Changes to date style path | ||
41 | In Search of an Online Utopia (info) | 302 | Changes domain and redirect to / |
- Page heading
- Name of the article for the set of links below
- Other tools which allow checking of contributions and basic page information.
- "Save changes" is used after setting actions using the drop down.
- Link
- Reference number
- The external link. May contain information extracted from {{cite web}} or {{citation}}.
- HTTP status code; tooltip contains the reason as stated by the web server.
- Analysis information. In this example it determined that one of the 302 redirects was likely a dead link.
Classifications
[lemba | kulemba source]Identifier | Rank | Meaning | Action |
---|---|---|---|
Working (white) | 0 | The link appears to work. | No action is necessary. |
Message (green) | 1 | An HTTP move (redirect) has occurred. | The link should work but should be checked. If the server responded with HTTP 301, consider updating it. |
Warn (yellow) | 2 | Link that could pose a problem to users. This includes expiring News sources, subscription required, or low signal to noise of links to text. | If the link is expiring, ensure that all critical details are filled in to allow someone to find an offline copy. |
Heuristically determined (orange) | 4 | The tool thinks that the link is dead. 404 in redirects or redirection / of the website.
|
Check the link. If dead, attempt to complete the archiveurl field with an archived copy from the Internet Archive. Otherwise tag with {{dead link}}.
|
Client error (red) | 5 | Server has confirmed the link as dead. | Ensure the link is correct and doesn't contain bits of wiki markup. If possible, use the archiveurl field to point to an archived copy from the Internet Archive. Otherwise tag with {{dead link}}.
|
Server error or connection issue (blue) | 3 | Five hundred server error or connection issue | If a server error, contact the webmaster to fix the problem. If a connection issue, check to see if the Whois is still valid. |
Bad link (purple) | 6 | Spamlink or Google cache link | A parking link should be removed. A Google cache link should be converted back to a regular link or the archiveurl field should be used.
|
Repair
[lemba | kulemba source]Once the page has fully loaded, select an article to work on. Click on the link to make sure the tool has correctly identified the problem (errors can be reported on the talk page). If the link is incorrect you can try a Google search to locate it again, right-click and copy the URL, and paste into prompt create by the "Input correct URL" option or "Input archive URL". The color in the box on the left changes to the type of replacement that will be performed on the URL. When you're finished click "Save changes" and the tool will merge your changes and present a preview or the difference before letting you save.
Redirects
[lemba | kulemba source]There are principally two types of redirects used:[note 2] HTTP 301 (permanent redirect) and HTTP 302 (regular redirect). In the former it is recommended that the site update the URL to use the new address. While in contrast, the latter is optional and should be reviewed by a human operator.
Some links might be access redirect as to avoid the need to log into a system. These may be said to be permalink. Finally, there are redirects that point to fake or soft 404 pages. Do not blindly change these links![clarification needed]
Do not "fix" redirects
[lemba | kulemba source]- Removes access to archive history by WebCite and the Wayback Machine at the old URL
- WP:NOTBROKEN calculates the cost an edit far excesses the value of fixing a MediaWiki redirect. A similar thing can be said about redirect on external links.
Archives
[lemba | kulemba source]The Wayback Machine is a valuable tool for dead link repair. The simplest way to get the list of links from the Wayback Machine is to click on the row. You can also load the results manually and paste them in using the "Use archive URL" option. The software will attempt to insert the URL using the archiveurl parameter of {{cite web}}.
Tips
[lemba | kulemba source]- Most non-news links can be found again by doing a search with the title of the link. This is the default setup for searching.
- Link can be taken from the Google results via right-clicking and selecting "Copy Link Location" and inputting it through the drop down.
- Always check the link by clicking on it (not the row) as some websites do not like how tools send requests (false positive) or the tool was not smart enough to handle the incorrect error handling (false negative).
- Non-HTML document can sometimes be found by searching for their file name.
- If Google turns up the same link, leave it be as it has recently or temporally become dead and you will not find a replacement until the Google's index is updated.
- You may wish to email the webmaster asking them to use redirection to keep the old links working.
Internal workings
[lemba | kulemba source]The tool downloads the wiki text using the edit page. It checks that the page exists and is not a redirect. Then it processes the markup: escaping certain comments so they are visible, remove nowiki'ed parts, expand link templates, numbering bracketed links, adding reference numbers, and marking links tagged with {{dead link}}. Since templates are not actually expanded this prevents some from working as intended, most notably external link templates. A possible remedy is to use a better parser such as mwlib from Collection. The parsed paged can be seen by appending &source=yes&debug=yes
to the end of the URL.
Limitations
[lemba | kulemba source]- BBC.com has blocked the tool in the past, and this domain is now disabled to prevent waiting for connection timeouts
- Excludes external links transcluded from templates, this is on purpose as the tool wouldn't be able to modify these when saving.
Linking
[lemba | kulemba source]It is preferable to link using the [1] interwiki prefix. Change the link as such:
[http://toolserver.org/~dispenser/view/Checklinks checklinks] [[tools:~dispenser/view/Checklinks|checklinks]]
Linking to a specific page (swap ?page=
for /
):
[http://toolserver.org/~dispenser/cgi-bin/webchecklinks.py?page=Edip_Yuksel Edip Yuksel links] [[tools:~dispenser/cgi-bin/webchecklinks.py/Edip_Yuksel |Edip Yuksel links]]
Praise
[lemba | kulemba source]The da Vinci Barnstar | ||
For the work you do on the link checker tool, which makes FAC so much easier. Thank you. Ealdgyth - Talk 14:53, 18 May 2008 (UTC) |
Documentation TODO
[lemba | kulemba source]- ADD information for website to opt out of scanning
- Break things up so can be read non-linearly (i.e. use pictures, bullets)
- Explain why detection isn't 100%. Give examples of website that return 404 for content. Others which are dead until the disks on the server finish spin up. Those which return 200 on Error pages, etc.
- Users don't seem to understand that they can make edits WITH the tool or search the Internet Archive Wayback Machine and WebCite (archive too).
Notes
[lemba | kulemba source]- ↑ Temporarily, see User_talk:Dispenser#User:Dispenser.2FChecklinks Old website, not working as of August 2017: http://dispenser.homenet.org/~dispenser/view/Checklinks
- ↑ An HTTP redirect is not the same as a redirect used on Wikipedia.
See also
[lemba | kulemba source]- /config a template that sets up periodic checking for links for a project
- Wikipedia:External links
- Wikipedia:Link rot
- Wikipedia:Wikipedia Signpost/2008-06-26/Dispatches
External links
[lemba | kulemba source]- W3C checklink
- W3C Style: Cool URIs don't change
- Xenu's Link Sleuth – Windows program that can audit any website