# Obsidian URL Name Extractor

**Forked from**: [obsidian-url-namer](https://github.com/zfei/obsidian-url-namer) by [zfei](https://github.com/zfei)    

This is a plugin for Obsidian (https://obsidian.md) that retrieves HTML titles to name raw URL links.

## What's New in This Fork

- **Improved URL detection**: Liberal regex pattern that correctly handles all valid URLs including domains with any valid TLD, DOI links, and academic article URLs
- **Progressive complexity**: Smart request strategy that tries simple approaches first, only adding complexity when needed to avoid triggering anti-bot systems
- **Configurable URL regex**: Customize the URL matching pattern in plugin settings
- **Site-specific title extraction**: Define custom title regex patterns for websites with non-standard HTML
- **Multiple fallback methods**: Archive.org and Microlink API support for bot-protected sites
- **Better error handling**: Clear error messages and graceful degradation

## Usage

Select the text that contains the URLs to be named, execute the command *Name the URL links in the selected text*.

It's recommended to name few URLs at a time. In the case when the URL requests are taking some time, please **DO NOT** change the text selection or the content itself, before the command is done. Otherwise, the eventual result will be out of order.

Easier with the command binded to a keyboard shortcut.

![demo](demo/url-namer-demo.gif)

## Settings

### URL Regex Pattern

Customize the regular expression used to detect URLs in your text. The default pattern is:

```regex
https?:\/\/[^\s\]\)]+
```

This pattern is compatible with all devices including iOS. It matches any `http://` or `https://` URL and automatically filters out URLs already in markdown links `[text](url)`. The plugin validates URLs using the native URL constructor, so invalid URLs are safely ignored.

### Site-Specific Title Patterns

For websites that don't use standard `<title>` tags or use lazy-loaded content, you can define custom patterns. Each pattern consists of:
- **URL Match**: A string to identify the website (e.g., `example.com`)
- **Title Regex**: A regex pattern to extract the title (e.g., `<meta property="og:title" content="([^"]*)"`)

Add one pattern per line in the format: `urlMatch|titleRegex`

**Example**:
```
arxiv.org|<meta name="citation_title" content="([^"]*)"
pubmed.ncbi.nlm.nih.gov|<meta name="citation_title" content="([^"]*)"
```

If no site-specific pattern matches, the plugin falls back to extracting from `<title>` tags.

## Known Limitations

### Bot Protection

Some websites use Cloudflare, AWS WAF, or other bot protection systems. The plugin handles these cases through its progressive approach:

1. Attempts simple request (avoids triggering protection)
2. Falls back to browser emulation if needed
3. Uses external services if bot protection is detected

**Common protection systems:**
- Cloudflare challenge pages
- AWS WAF cookie challenges
- JavaScript-based bot detection

**When you'll see errors:**
If bot protection is detected and no fallback methods are enabled:
```
Error: Bot protection detected. Enable a fallback method in settings.
```

**Recommended approach:**
Enable Microlink fallback in settings for the most reliable experience with protected sites.

### Archive.org Fallback

When enabled in settings, the plugin will automatically attempt to fetch titles from Archive.org's Wayback Machine if a site blocks direct access. This:
- Works for sites with Cloudflare or other bot protection
- May use slightly outdated content (shows date of archived snapshot)
- Adds a small delay while checking for archived versions
- Won't work for very recent URLs that haven't been archived yet

### Microlink Fallback

When enabled in settings, the plugin uses [Microlink API](https://microlink.io) to fetch titles from protected sites. This is the **recommended fallback** for protected sites.

**Features:**
- More reliable than Archive.org for recent content
- Works with most protected sites
- No account required for free tier

**Limitations:**
- ⚠️ **Free tier: 50 requests/day** — when exhausted, falls back to Archive.org if enabled
- URLs are sent to a third-party service (Microlink)
- Optional API key field for users with paid plans

### How Title Fetching Works

The plugin uses a progressive complexity approach to maximize compatibility:

1. **Simple request** — Clean HTTP request with minimal headers
   - Works for the majority of websites
   - Avoids triggering anti-bot systems
   - Faster response times
   - Obsidian's requestUrl automatically handles HTTP redirects (301, 302, etc.)

2. **Complex browser emulation** — Full browser-like headers if simple request fails
   - User-Agent, Accept, Referer, and other browser headers
   - Used automatically when simple approach doesn't work

3. **External fallback services** — When bot protection is detected:
   - **Microlink API** — Headless browser service (requires sending URLs to third-party)
   - **Archive.org** — Wayback Machine archived snapshots

Both fallback methods are **disabled by default**. Enable them in settings if you frequently encounter protected sites.

### HTML Entity Decoding

Page titles are automatically decoded for common HTML entities:
- Basic entities: `&amp;`, `&lt;`, `&gt;`, `&quot;`, `&apos;`, `&nbsp;`
- Typographic quotes: `&rsquo;`, `&lsquo;`, `&rdquo;`, `&ldquo;`
- Dashes: `&ndash;`, `&mdash;`
- Special characters: `&hellip;`, `&bull;`
- Numeric entities: `&#123;` (decimal) and `&#xAB;` (hexadecimal)

### Fallback Priority

When both Archive.org and Microlink fallbacks are enabled, you can choose the priority order:
- **Microlink → Archive.org** (recommended): More reliable for recent content
- **Archive.org → Microlink** (privacy-focused): Tries non-profit Archive.org first

## Troubleshooting

### URLs aren't being detected

Check your URL regex pattern in settings. The default pattern requires `http://` or `https://` prefix:
```regex
https?:\/\/[^\s\]\)]+
```

The plugin automatically skips URLs already in markdown links `[text](url)`.

### Microlink rate limit errors

The free tier of Microlink allows 50 requests per day. When this limit is reached:
1. Error message: "Microlink daily limit reached (50/day)"
2. Plugin automatically tries Archive.org fallback if enabled
3. Consider adding an API key in settings for higher limits

### Site-specific issues

For websites with non-standard HTML or lazy-loaded titles, configure site-specific patterns in settings. See the Site-Specific Title Patterns section above.

## Compilation

- Clone this repo.
- `npm i` or `yarn` to install dependencies
- `npm run build` to compile, or `npm run dev` to start compilation in watch mode.

## Installation

- After compiled, rename the `dist` directory to `obsidian-url-name-extractor` and move it into the vault's plugin directory `VaultFolder/.obsidian/plugins/`.

## Credits

Original plugin created by [zfei](https://github.com/zfei). This fork adds configurable settings, multiple fallback methods for bot-protected sites, and improved URL handling.