1 Commits

Autor SHA1 Mensaje Fecha
ale
b91d19dc0b fix memory remove dup
Signed-off-by: ale <ale@manalejandro.com>
2025-12-21 22:36:31 +01:00
Se han modificado 18 ficheros con 1159 adiciones y 1231 borrados

Ver fichero

@@ -1,17 +1,5 @@
# Redis Configuration # Elasticsearch Configuration
# Optional: Customize Redis connection settings ELASTICSEARCH_NODE=http://localhost:9200
# Redis host (default: localhost) # Optional: Set to 'development' or 'production'
REDIS_HOST=localhost # NODE_ENV=development
# Redis port (default: 6379)
REDIS_PORT=6379
# Redis password (optional, required if Redis has authentication enabled)
# REDIS_PASSWORD=your-secure-password
# Redis database number (default: 0)
# REDIS_DB=0
# Node Environment
NODE_ENV=development

34
API.md
Ver fichero

@@ -102,7 +102,7 @@ Content-Type: application/json
} }
``` ```
Note: When plaintext is provided, it is automatically stored in Redis for future lookups. Note: When plaintext is provided, it is automatically indexed in Elasticsearch for future lookups.
#### Error Responses #### Error Responses
@@ -113,7 +113,7 @@ Note: When plaintext is provided, it is automatically stored in Redis for future
} }
``` ```
**500 Internal Server Error** - Server or Redis error: **500 Internal Server Error** - Server or Elasticsearch error:
```json ```json
{ {
"error": "Internal server error", "error": "Internal server error",
@@ -127,7 +127,7 @@ Note: When plaintext is provided, it is automatically stored in Redis for future
**Endpoint**: `GET /api/health` **Endpoint**: `GET /api/health`
**Description**: Check the health of the application and Redis connection. **Description**: Check the health of the application and Elasticsearch connection.
#### Request #### Request
@@ -139,27 +139,31 @@ No parameters required.
```json ```json
{ {
"status": "ok", "status": "ok",
"redis": { "elasticsearch": {
"version": "7.2.4", "cluster": "elasticsearch",
"connected": true, "status": "green"
"memoryUsed": "1.5M"
}, },
"stats": { "index": {
"count": 1542, "exists": true,
"size": 524288 "name": "hasher",
"stats": {
"documentCount": 1542,
"indexSize": 524288
}
} }
} }
``` ```
**Redis connection status**: **Elasticsearch cluster status values**:
- `connected: true`: Redis is connected and responding - `green`: All primary and replica shards are active
- `connected: false`: Redis connection failed - `yellow`: All primary shards are active, but not all replicas
- `red`: Some primary shards are not active
**Error** (503 Service Unavailable): **Error** (503 Service Unavailable):
```json ```json
{ {
"status": "error", "status": "error",
"error": "Connection refused to Redis" "error": "Connection refused to Elasticsearch"
} }
``` ```
@@ -248,7 +252,7 @@ The API accepts requests from any origin by default. For production deployment,
## Notes ## Notes
- All timestamps are in ISO 8601 format - All timestamps are in ISO 8601 format
- The API automatically creates Redis keys as needed - The API automatically creates the Elasticsearch index if it doesn't exist
- Plaintext searches are automatically indexed for future lookups - Plaintext searches are automatically indexed for future lookups
- Searches are case-insensitive - Searches are case-insensitive
- Hashes must be valid hexadecimal strings - Hashes must be valid hexadecimal strings

Ver fichero

@@ -17,7 +17,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Copy to clipboard functionality for all hash values - Copy to clipboard functionality for all hash values
#### Backend #### Backend
- Redis integration with ioredis - Elasticsearch integration with configurable endpoint
- Custom index mapping with 10 shards for horizontal scaling - Custom index mapping with 10 shards for horizontal scaling
- Automatic index creation on first use - Automatic index creation on first use
- Auto-indexing of searched plaintext for future lookups - Auto-indexing of searched plaintext for future lookups
@@ -62,7 +62,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
#### Dependencies #### Dependencies
- Next.js 16.0.7 - Next.js 16.0.7
- React 19.2.0 - React 19.2.0
- ioredis 5.4.2 - Elasticsearch Client 8.x
- Lucide React (icons) - Lucide React (icons)
- Tailwind CSS 4.x - Tailwind CSS 4.x
- TypeScript 5.x - TypeScript 5.x
@@ -75,14 +75,14 @@ hasher/
│ ├── layout.tsx # Root layout │ ├── layout.tsx # Root layout
│ └── page.tsx # Main page │ └── page.tsx # Main page
├── lib/ # Utility libraries ├── lib/ # Utility libraries
│ ├── redis.ts # Redis client │ ├── elasticsearch.ts # ES client
│ └── hash.ts # Hash utilities │ └── hash.ts # Hash utilities
├── scripts/ # CLI scripts ├── scripts/ # CLI scripts
│ └── index-file.ts # Bulk indexer │ └── index-file.ts # Bulk indexer
└── docs/ # Documentation └── docs/ # Documentation
``` ```
#### Redis Data Structure #### Elasticsearch Index Schema
- Index name: `hasher` - Index name: `hasher`
- Shards: 10 - Shards: 10
- Replicas: 1 - Replicas: 1
@@ -91,9 +91,7 @@ hasher/
### Configuration ### Configuration
#### Environment Variables #### Environment Variables
- `REDIS_HOST`: Redis server host (default: localhost) - `ELASTICSEARCH_NODE`: Elasticsearch endpoint (default: http://localhost:9200)
- `REDIS_PORT`: Redis server port (default: 6379)
- `REDIS_PASSWORD`: Redis authentication password (optional)
#### Performance #### Performance
- Bulk indexing: 1000-5000 docs/sec - Bulk indexing: 1000-5000 docs/sec

Ver fichero

@@ -48,7 +48,7 @@ Thank you for considering contributing to Hasher! This document provides guideli
Before submitting a PR: Before submitting a PR:
1. Test the web interface thoroughly 1. Test the web interface thoroughly
2. Test the bulk indexing script 2. Test the bulk indexing script
3. Verify Redis integration 3. Verify Elasticsearch integration
4. Check for TypeScript errors: `npm run build` 4. Check for TypeScript errors: `npm run build`
5. Run linter: `npm run lint` 5. Run linter: `npm run lint`

Ver fichero

@@ -5,7 +5,7 @@ This guide covers deploying the Hasher application to production.
## Prerequisites ## Prerequisites
- Node.js 18.x or higher - Node.js 18.x or higher
- Redis 6.x or higher - Elasticsearch 8.x cluster
- Domain name (optional, for custom domain) - Domain name (optional, for custom domain)
- SSL certificate (recommended for production) - SSL certificate (recommended for production)
@@ -34,15 +34,12 @@ Vercel provides seamless deployment for Next.js applications.
4. **Set Environment Variables**: 4. **Set Environment Variables**:
- Go to your project settings on Vercel - Go to your project settings on Vercel
- Add environment variables: - Add environment variable: `ELASTICSEARCH_NODE=http://your-elasticsearch-host:9200`
- `REDIS_HOST=your-redis-host.com`
- `REDIS_PORT=6379`
- `REDIS_PASSWORD=your-secure-password` (if using authentication)
- Redeploy: `vercel --prod` - Redeploy: `vercel --prod`
#### Important Notes: #### Important Notes:
- Ensure Redis is accessible from Vercel's servers - Ensure Elasticsearch is accessible from Vercel's servers
- Consider using [Upstash](https://upstash.com) or [Redis Cloud](https://redis.com/try-free/) for managed Redis - Consider using Elastic Cloud or a publicly accessible Elasticsearch instance
- Use environment variables for sensitive configuration - Use environment variables for sensitive configuration
--- ---
@@ -62,7 +59,7 @@ FROM base AS deps
RUN apk add --no-cache libc6-compat RUN apk add --no-cache libc6-compat
WORKDIR /app WORKDIR /app
COPY package.json package-lock.json* ./ COPY package.json package-lock.json ./
RUN npm ci RUN npm ci
# Rebuild the source code only when needed # Rebuild the source code only when needed
@@ -71,13 +68,15 @@ WORKDIR /app
COPY --from=deps /app/node_modules ./node_modules COPY --from=deps /app/node_modules ./node_modules
COPY . . COPY . .
ENV NEXT_TELEMETRY_DISABLED=1
RUN npm run build RUN npm run build
# Production image, copy all the files and run next # Production image, copy all the files and run next
FROM base AS runner FROM base AS runner
WORKDIR /app WORKDIR /app
ENV NODE_ENV production ENV NODE_ENV=production
ENV NEXT_TELEMETRY_DISABLED=1
RUN addgroup --system --gid 1001 nodejs RUN addgroup --system --gid 1001 nodejs
RUN adduser --system --uid 1001 nextjs RUN adduser --system --uid 1001 nextjs
@@ -90,11 +89,24 @@ USER nextjs
EXPOSE 3000 EXPOSE 3000
ENV PORT 3000 ENV PORT=3000
ENV HOSTNAME="0.0.0.0"
CMD ["node", "server.js"] CMD ["node", "server.js"]
``` ```
#### Update next.config.ts:
```typescript
import type { NextConfig } from 'next';
const nextConfig: NextConfig = {
output: 'standalone',
};
export default nextConfig;
```
#### Build and Run: #### Build and Run:
```bash ```bash
@@ -104,9 +116,7 @@ docker build -t hasher:latest .
# Run the container # Run the container
docker run -d \ docker run -d \
-p 3000:3000 \ -p 3000:3000 \
-e REDIS_HOST=redis \ -e ELASTICSEARCH_NODE=http://elasticsearch:9200 \
-e REDIS_PORT=6379 \
-e REDIS_PASSWORD=your-password \
--name hasher \ --name hasher \
hasher:latest hasher:latest
``` ```
@@ -124,24 +134,25 @@ services:
ports: ports:
- "3000:3000" - "3000:3000"
environment: environment:
- REDIS_HOST=redis - ELASTICSEARCH_NODE=http://elasticsearch:9200
- REDIS_PORT=6379
- REDIS_PASSWORD=your-secure-password
depends_on: depends_on:
- redis - elasticsearch
restart: unless-stopped restart: unless-stopped
redis: elasticsearch:
image: redis:7-alpine image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0
command: redis-server --requirepass your-secure-password --appendonly yes environment:
- discovery.type=single-node
- xpack.security.enabled=false
- "ES_JAVA_OPTS=-Xms512m -Xmx512m"
ports: ports:
- "6379:6379" - "9200:9200"
volumes: volumes:
- redis-data:/data - elasticsearch-data:/usr/share/elasticsearch/data
restart: unless-stopped restart: unless-stopped
volumes: volumes:
redis-data: elasticsearch-data:
``` ```
Run with: Run with:
@@ -162,28 +173,13 @@ curl -fsSL https://deb.nodesource.com/setup_18.x | sudo -E bash -
sudo apt-get install -y nodejs sudo apt-get install -y nodejs
``` ```
#### 2. Install Redis: #### 2. Install PM2 (Process Manager):
```bash
sudo apt-get update
sudo apt-get install redis-server
# Configure Redis
sudo nano /etc/redis/redis.conf
# Set: requirepass your-strong-password
# Start Redis
sudo systemctl start redis-server
sudo systemctl enable redis-server
```
#### 3. Install PM2 (Process Manager):
```bash ```bash
sudo npm install -g pm2 sudo npm install -g pm2
``` ```
#### 4. Clone and Build: #### 3. Clone and Build:
```bash ```bash
cd /var/www cd /var/www
@@ -193,18 +189,16 @@ npm install
npm run build npm run build
``` ```
#### 5. Configure Environment: #### 4. Configure Environment:
```bash ```bash
cat > .env.local << EOF cat > .env.local << EOF
REDIS_HOST=localhost ELASTICSEARCH_NODE=http://localhost:9200
REDIS_PORT=6379
REDIS_PASSWORD=your-strong-password
NODE_ENV=production NODE_ENV=production
EOF EOF
``` ```
#### 6. Start with PM2: #### 5. Start with PM2:
```bash ```bash
pm2 start npm --name "hasher" -- start pm2 start npm --name "hasher" -- start
@@ -212,7 +206,7 @@ pm2 save
pm2 startup pm2 startup
``` ```
#### 7. Configure Nginx (Optional): #### 6. Configure Nginx (Optional):
```nginx ```nginx
server { server {
@@ -239,62 +233,43 @@ sudo systemctl reload nginx
--- ---
## Redis Setup ## Elasticsearch Setup
### Option 1: Managed Redis (Recommended) ### Option 1: Elastic Cloud (Managed)
#### Upstash (Serverless Redis) 1. Sign up at [Elastic Cloud](https://cloud.elastic.co/)
1. Sign up at [Upstash](https://upstash.com) 2. Create a deployment
2. Create a database 3. Note the endpoint URL
3. Copy connection details 4. Update `ELASTICSEARCH_NODE` environment variable
4. Update environment variables
#### Redis Cloud ### Option 2: Self-Hosted
1. Sign up at [Redis Cloud](https://redis.com/try-free/)
2. Create a database
3. Note the endpoint and password
4. Update `REDIS_HOST`, `REDIS_PORT`, and `REDIS_PASSWORD`
### Option 2: Self-Hosted Redis
```bash ```bash
# Ubuntu/Debian # Ubuntu/Debian
wget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add -
sudo sh -c 'echo "deb https://artifacts.elastic.co/packages/8.x/apt stable main" > /etc/apt/sources.list.d/elastic-8.x.list'
sudo apt-get update sudo apt-get update
sudo apt-get install redis-server sudo apt-get install elasticsearch
# Configure Redis security # Configure
sudo nano /etc/redis/redis.conf sudo nano /etc/elasticsearch/elasticsearch.yml
# Set: network.host: 0.0.0.0
# Important settings: # Start
# bind 127.0.0.1 ::1 # Only local connections (remove for remote) sudo systemctl start elasticsearch
# requirepass your-strong-password sudo systemctl enable elasticsearch
# maxmemory 256mb
# maxmemory-policy allkeys-lru
# Start Redis
sudo systemctl start redis-server
sudo systemctl enable redis-server
# Test connection
redis-cli -a your-strong-password ping
``` ```
--- ---
## Security Considerations ## Security Considerations
### 1. Redis Security ### 1. Elasticsearch Security
- **Always** use a strong password with `requirepass` - Enable authentication on Elasticsearch
- Bind Redis to localhost if possible (`bind 127.0.0.1`) - Use HTTPS for Elasticsearch connection
- Use TLS/SSL for remote connections (Redis 6+) - Restrict network access with firewall rules
- Disable dangerous commands: - Update credentials regularly
```
rename-command FLUSHDB ""
rename-command FLUSHALL ""
rename-command CONFIG ""
```
- Set memory limits to prevent OOM
### 2. Application Security ### 2. Application Security
@@ -310,7 +285,7 @@ redis-cli -a your-strong-password ping
# Example UFW firewall rules # Example UFW firewall rules
sudo ufw allow 80/tcp sudo ufw allow 80/tcp
sudo ufw allow 443/tcp sudo ufw allow 443/tcp
sudo ufw allow from YOUR_IP to any port 6379 # Redis (if remote) sudo ufw allow from YOUR_IP to any port 9200 # Elasticsearch
sudo ufw enable sudo ufw enable
``` ```
@@ -328,96 +303,37 @@ pm2 monit
pm2 logs hasher pm2 logs hasher
``` ```
### Redis Monitoring ### Elasticsearch Monitoring
```bash ```bash
# Test connection # Health check
redis-cli ping curl http://localhost:9200/_cluster/health?pretty
# Get server info # Index stats
redis-cli INFO curl http://localhost:9200/hasher/_stats?pretty
# Monitor commands
redis-cli MONITOR
# Check memory usage
redis-cli INFO memory
# Check stats
redis-cli INFO stats
``` ```
--- ---
## Backup and Recovery ## Backup and Recovery
### Redis Persistence ### Elasticsearch Snapshots
Redis offers two persistence options:
#### RDB (Redis Database Backup)
```bash
# Configure in redis.conf
save 900 1 # Save if 1 key changed in 15 minutes
save 300 10 # Save if 10 keys changed in 5 minutes
save 60 10000 # Save if 10000 keys changed in 1 minute
# Manual snapshot
redis-cli SAVE
# Backup file location
/var/lib/redis/dump.rdb
```
#### AOF (Append Only File)
```bash
# Enable in redis.conf
appendonly yes
appendfilename "appendonly.aof"
# Sync options
appendfsync everysec # Good balance
# Backup file location
/var/lib/redis/appendonly.aof
```
### Backup Script
```bash ```bash
#!/bin/bash # Configure snapshot repository
# backup-redis.sh curl -X PUT "localhost:9200/_snapshot/hasher_backup" -H 'Content-Type: application/json' -d'
{
"type": "fs",
"settings": {
"location": "/mnt/backups/elasticsearch"
}
}'
BACKUP_DIR="/backup/redis" # Create snapshot
DATE=$(date +%Y%m%d_%H%M%S) curl -X PUT "localhost:9200/_snapshot/hasher_backup/snapshot_1?wait_for_completion=true"
# Create backup directory # Restore snapshot
mkdir -p $BACKUP_DIR curl -X POST "localhost:9200/_snapshot/hasher_backup/snapshot_1/_restore"
# Trigger Redis save
redis-cli -a your-password SAVE
# Copy RDB file
cp /var/lib/redis/dump.rdb $BACKUP_DIR/dump_$DATE.rdb
# Keep only last 7 days
find $BACKUP_DIR -name "dump_*.rdb" -mtime +7 -delete
echo "Backup completed: dump_$DATE.rdb"
```
### Restore from Backup
```bash
# Stop Redis
sudo systemctl stop redis-server
# Replace dump file
sudo cp /backup/redis/dump_YYYYMMDD_HHMMSS.rdb /var/lib/redis/dump.rdb
sudo chown redis:redis /var/lib/redis/dump.rdb
# Start Redis
sudo systemctl start redis-server
``` ```
--- ---
@@ -427,24 +343,14 @@ sudo systemctl start redis-server
### Horizontal Scaling ### Horizontal Scaling
1. Deploy multiple Next.js instances 1. Deploy multiple Next.js instances
2. Use a load balancer (nginx, HAProxy, Cloudflare) 2. Use a load balancer (nginx, HAProxy)
3. Share the same Redis instance 3. Share the same Elasticsearch cluster
### Redis Scaling Options ### Elasticsearch Scaling
#### 1. Redis Cluster 1. Add more nodes to the cluster
- Automatic sharding across multiple nodes 2. Increase shard count (already set to 10)
- High availability with automatic failover 3. Use replicas for read scaling
- Good for very large datasets
#### 2. Redis Sentinel
- High availability without sharding
- Automatic failover
- Monitoring and notifications
#### 3. Read Replicas
- Separate read and write operations
- Scale read capacity
--- ---
@@ -457,40 +363,28 @@ pm2 status
pm2 logs hasher --lines 100 pm2 logs hasher --lines 100
``` ```
### Check Redis ### Check Elasticsearch
```bash ```bash
# Test connection curl http://localhost:9200/_cluster/health
redis-cli ping curl http://localhost:9200/hasher/_count
# Check memory
redis-cli INFO memory
# Count keys
redis-cli DBSIZE
# Get stats
redis-cli INFO stats
``` ```
### Common Issues ### Common Issues
**Issue**: Cannot connect to Redis **Issue**: Cannot connect to Elasticsearch
- Check if Redis is running: `sudo systemctl status redis-server` - Check firewall rules
- Verify firewall rules - Verify Elasticsearch is running
- Check `REDIS_HOST` and `REDIS_PORT` environment variables - Check `ELASTICSEARCH_NODE` environment variable
- Verify password is correct
**Issue**: Out of memory **Issue**: Out of memory
- Increase Node.js memory: `NODE_OPTIONS=--max-old-space-size=4096` - Increase Node.js memory: `NODE_OPTIONS=--max-old-space-size=4096`
- Configure Redis maxmemory - Increase Elasticsearch heap size
- Set appropriate eviction policy
**Issue**: Slow searches **Issue**: Slow searches
- Check Redis memory usage - Add more Elasticsearch nodes
- Verify O(1) key lookups are being used - Optimize queries
- Monitor Redis with `redis-cli MONITOR` - Increase replica count
- Consider Redis Cluster for very large datasets
--- ---
@@ -498,25 +392,9 @@ redis-cli INFO stats
1. **Enable Next.js Static Optimization** 1. **Enable Next.js Static Optimization**
2. **Use CDN for static assets** 2. **Use CDN for static assets**
3. **Configure Redis pipelining** (already implemented) 3. **Enable Elasticsearch caching**
4. **Set appropriate maxmemory and eviction policy** 4. **Configure appropriate JVM heap for Elasticsearch**
5. **Use SSD storage for Redis persistence** 5. **Use SSD storage for Elasticsearch**
6. **Enable connection pooling** (already implemented)
7. **Monitor and optimize Redis memory usage**
---
## Environment Variables
| Variable | Description | Default | Required |
|----------|-------------|---------|----------|
| `REDIS_HOST` | Redis server hostname | `localhost` | No |
| `REDIS_PORT` | Redis server port | `6379` | No |
| `REDIS_PASSWORD` | Redis authentication password | - | No* |
| `NODE_ENV` | Node environment | `development` | No |
| `PORT` | Application port | `3000` | No |
*Required if Redis has authentication enabled
--- ---
@@ -524,28 +402,5 @@ redis-cli INFO stats
For deployment issues, check: For deployment issues, check:
- [Next.js Deployment Docs](https://nextjs.org/docs/deployment) - [Next.js Deployment Docs](https://nextjs.org/docs/deployment)
- [Redis Documentation](https://redis.io/docs/) - [Elasticsearch Setup Guide](https://www.elastic.co/guide/en/elasticsearch/reference/current/setup.html)
- [Upstash Documentation](https://docs.upstash.com/)
- Project GitHub Issues - Project GitHub Issues
---
## Deployment Checklist
Before going live:
- [ ] Redis is secured with password
- [ ] Environment variables are configured
- [ ] SSL/TLS certificates are installed
- [ ] Firewall rules are configured
- [ ] Monitoring is set up
- [ ] Backup strategy is in place
- [ ] Load testing completed
- [ ] Error logging configured
- [ ] Redis persistence (RDB/AOF) configured
- [ ] Rate limiting implemented (if needed)
- [ ] Documentation is up to date
---
**Ready to deploy! 🚀**

Ver fichero

@@ -2,7 +2,7 @@
## 📋 Project Overview ## 📋 Project Overview
**Hasher** is a modern, high-performance hash search and generation tool built with Next.js and powered by Redis. It provides a beautiful web interface for searching hash values and generating cryptographic hashes from plaintext. **Hasher** is a modern, high-performance hash search and generation tool built with Next.js and powered by Elasticsearch. It provides a beautiful web interface for searching hash values and generating cryptographic hashes from plaintext.
### Version: 1.0.0 ### Version: 1.0.0
### Status: ✅ Production Ready ### Status: ✅ Production Ready
@@ -25,7 +25,7 @@
- Copy-to-clipboard functionality - Copy-to-clipboard functionality
### 📊 Backend ### 📊 Backend
- Redis integration with ioredis - Elasticsearch 8.x integration
- 10-shard index for horizontal scaling - 10-shard index for horizontal scaling
- RESTful API with JSON responses - RESTful API with JSON responses
- Automatic index creation and initialization - Automatic index creation and initialization
@@ -52,7 +52,7 @@
### Stack ### Stack
- **Frontend**: Next.js 16.0, React 19.2, Tailwind CSS 4.x - **Frontend**: Next.js 16.0, React 19.2, Tailwind CSS 4.x
- **Backend**: Next.js API Routes, Node.js 18+ - **Backend**: Next.js API Routes, Node.js 18+
- **Database**: Redis 6.x+ - **Database**: Elasticsearch 8.x
- **Language**: TypeScript 5.x - **Language**: TypeScript 5.x
- **Icons**: Lucide React - **Icons**: Lucide React
@@ -68,7 +68,7 @@ hasher/
│ └── globals.css # Global styles │ └── globals.css # Global styles
├── lib/ ├── lib/
│ ├── redis.ts # Redis client & config │ ├── elasticsearch.ts # ES client & config
│ └── hash.ts # Hash utilities │ └── hash.ts # Hash utilities
├── scripts/ ├── scripts/
@@ -106,7 +106,7 @@ Search for hashes or generate from plaintext
- **Output**: Hash results or generated hashes - **Output**: Hash results or generated hashes
### GET /api/health ### GET /api/health
Check system health and Redis status Check system health and Elasticsearch status
- **Output**: System status and statistics - **Output**: System status and statistics
--- ---
@@ -139,15 +139,13 @@ npm run index-file wordlist.txt -- --batch-size 500
### Environment Configuration ### Environment Configuration
```bash ```bash
# Optional: Set Redis connection # Optional: Set Elasticsearch endpoint
export REDIS_HOST=localhost export ELASTICSEARCH_NODE=http://localhost:9200
export REDIS_PORT=6379
export REDIS_PASSWORD=your-password
``` ```
--- ---
## 🗄️ Redis Data Structure ## 🗄️ Elasticsearch Configuration
### Index: `hasher` ### Index: `hasher`
- **Shards**: 10 (horizontal scaling) - **Shards**: 10 (horizontal scaling)
@@ -222,9 +220,9 @@ export REDIS_PASSWORD=your-password
### Requirements ### Requirements
- Node.js 18.x or higher - Node.js 18.x or higher
- Redis 6.x or higher - Elasticsearch 8.x
- 512MB RAM minimum - 512MB RAM minimum
- Redis server (local or remote) - Internet connection for Elasticsearch
--- ---
@@ -287,7 +285,7 @@ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file
## 🙏 Acknowledgments ## 🙏 Acknowledgments
- Built with [Next.js](https://nextjs.org/) - Built with [Next.js](https://nextjs.org/)
- Powered by [Redis](https://redis.io/) - Powered by [Elasticsearch](https://www.elastic.co/)
- Icons by [Lucide](https://lucide.dev/) - Icons by [Lucide](https://lucide.dev/)
- Styled with [Tailwind CSS](https://tailwindcss.com/) - Styled with [Tailwind CSS](https://tailwindcss.com/)
@@ -315,7 +313,7 @@ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file
### Completed ✅ ### Completed ✅
- [x] Core hash search functionality - [x] Core hash search functionality
- [x] Hash generation from plaintext - [x] Hash generation from plaintext
- [x] Redis integration - [x] Elasticsearch integration
- [x] Modern responsive UI - [x] Modern responsive UI
- [x] Bulk indexing script - [x] Bulk indexing script
- [x] API endpoints - [x] API endpoints

Ver fichero

@@ -17,12 +17,6 @@ npm run index-file <file> -- --batch-size N # Custom batch size
npm run index-file -- --help # Show help npm run index-file -- --help # Show help
``` ```
### Duplicate Removal
```bash
npm run remove-duplicates -- --field md5 --dry-run # Preview duplicates
npm run remove-duplicates -- --field md5 --execute # Remove duplicates
```
## 🔍 Hash Detection Patterns ## 🔍 Hash Detection Patterns
| Type | Length | Example | | Type | Length | Example |
@@ -51,38 +45,32 @@ GET /api/health
- **Web Interface**: http://localhost:3000 - **Web Interface**: http://localhost:3000
- **Search API**: http://localhost:3000/api/search - **Search API**: http://localhost:3000/api/search
- **Health API**: http://localhost:3000/api/health - **Health API**: http://localhost:3000/api/health
- **Redis**: localhost:6379 - **Elasticsearch**: http://localhost:9200
## 📊 Redis Commands ## 📊 Elasticsearch Commands
```bash ```bash
# Test connection # Health
redis-cli ping curl http://localhost:9200/_cluster/health?pretty
# Get database stats # Index stats
redis-cli INFO stats curl http://localhost:9200/hasher/_stats?pretty
# Count all keys # Document count
redis-cli DBSIZE curl http://localhost:9200/hasher/_count?pretty
# List all hash documents # Search
redis-cli KEYS "hash:plaintext:*" curl http://localhost:9200/hasher/_search?pretty
# Get document # Delete index (CAUTION!)
redis-cli GET "hash:plaintext:password" curl -X DELETE http://localhost:9200/hasher
# Get statistics
redis-cli HGETALL hash:stats
# Clear all data (CAUTION!)
redis-cli FLUSHDB
``` ```
## 🐛 Troubleshooting ## 🐛 Troubleshooting
| Problem | Solution | | Problem | Solution |
|---------|----------| |---------|----------|
| Can't connect to Redis | Check `REDIS_HOST` and `REDIS_PORT` env vars | | Can't connect to ES | Check `ELASTICSEARCH_NODE` env var |
| Port 3000 in use | Use `PORT=3001 npm run dev` | | Port 3000 in use | Use `PORT=3001 npm run dev` |
| Module not found | Run `npm install` | | Module not found | Run `npm install` |
| Build errors | Run `npm run build` to see details | | Build errors | Run `npm run build` to see details |
@@ -93,18 +81,17 @@ redis-cli FLUSHDB
|------|---------| |------|---------|
| `app/page.tsx` | Main UI component | | `app/page.tsx` | Main UI component |
| `app/api/search/route.ts` | Search endpoint | | `app/api/search/route.ts` | Search endpoint |
| `lib/redis.ts` | Redis configuration | | `lib/elasticsearch.ts` | ES configuration |
| `lib/hash.ts` | Hash utilities | | `lib/hash.ts` | Hash utilities |
| `scripts/index-file.ts` | Bulk indexer | | `scripts/index-file.ts` | Bulk indexer |
| `scripts/remove-duplicates.ts` | Duplicate remover |
## ⚙️ Environment Variables ## ⚙️ Environment Variables
```bash ```bash
# Required
ELASTICSEARCH_NODE=http://localhost:9200
# Optional # Optional
REDIS_HOST=localhost
REDIS_PORT=6379
REDIS_PASSWORD=your-password
NODE_ENV=production NODE_ENV=production
``` ```
@@ -148,7 +135,6 @@ curl http://localhost:3000/api/health
```bash ```bash
npm run index-file -- --help # Indexer help npm run index-file -- --help # Indexer help
npm run remove-duplicates -- --help # Duplicate remover help
``` ```
--- ---

160
README.md
Ver fichero

@@ -1,9 +1,9 @@
# Hasher 🔐 # Hasher 🔐
A modern, high-performance hash search and generation tool powered by Redis and Next.js. Search for hash values to find their plaintext origins or generate hashes from any text input. A modern, high-performance hash search and generation tool powered by Elasticsearch and Next.js. Search for hash values to find their plaintext origins or generate hashes from any text input.
![Hasher Banner](https://img.shields.io/badge/Next.js-15.4-black?style=for-the-badge&logo=next.js) ![Hasher Banner](https://img.shields.io/badge/Next.js-16.0-black?style=for-the-badge&logo=next.js)
![Redis](https://img.shields.io/badge/Redis-7.x-DC382D?style=for-the-badge&logo=redis) ![Elasticsearch](https://img.shields.io/badge/Elasticsearch-8.x-005571?style=for-the-badge&logo=elasticsearch)
![TypeScript](https://img.shields.io/badge/TypeScript-5.x-3178C6?style=for-the-badge&logo=typescript) ![TypeScript](https://img.shields.io/badge/TypeScript-5.x-3178C6?style=for-the-badge&logo=typescript)
## ✨ Features ## ✨ Features
@@ -11,7 +11,7 @@ A modern, high-performance hash search and generation tool powered by Redis and
- 🔍 **Hash Lookup**: Search for MD5, SHA1, SHA256, and SHA512 hashes - 🔍 **Hash Lookup**: Search for MD5, SHA1, SHA256, and SHA512 hashes
- 🔑 **Hash Generation**: Generate multiple hash types from plaintext - 🔑 **Hash Generation**: Generate multiple hash types from plaintext
- 💾 **Auto-Indexing**: Automatically stores searched plaintext and hashes - 💾 **Auto-Indexing**: Automatically stores searched plaintext and hashes
- 📊 **Redis Backend**: Fast in-memory storage with persistence - 📊 **Elasticsearch Backend**: Scalable storage with 10 shards for performance
- 🚀 **Bulk Indexing**: Import wordlists via command-line script - 🚀 **Bulk Indexing**: Import wordlists via command-line script
- 🎨 **Modern UI**: Beautiful, responsive interface with real-time feedback - 🎨 **Modern UI**: Beautiful, responsive interface with real-time feedback
- 📋 **Copy to Clipboard**: One-click copying of any hash value - 📋 **Copy to Clipboard**: One-click copying of any hash value
@@ -32,8 +32,8 @@ A modern, high-performance hash search and generation tool powered by Redis and
┌─────────────┐ ┌─────────────┐
Redis │ ← In-memory storage Elasticsearch│ ← Distributed storage
│ with persistence 10 Shards │ (localhost:9200)
└─────────────┘ └─────────────┘
``` ```
@@ -42,7 +42,7 @@ A modern, high-performance hash search and generation tool powered by Redis and
### Prerequisites ### Prerequisites
- Node.js 18.x or higher - Node.js 18.x or higher
- Redis 7.x or higher - Elasticsearch 8.x running on `localhost:9200`
- npm or yarn - npm or yarn
### Installation ### Installation
@@ -58,28 +58,20 @@ A modern, high-performance hash search and generation tool powered by Redis and
npm install npm install
``` ```
3. **Configure Redis** (optional) 3. **Configure Elasticsearch** (optional)
By default, the app connects to `localhost:6379`. To change this: By default, the app connects to `http://localhost:9200`. To change this:
```bash ```bash
export REDIS_HOST=localhost export ELASTICSEARCH_NODE=http://your-elasticsearch-host:9200
export REDIS_PORT=6379
export REDIS_PASSWORD=your_password # Optional
export REDIS_DB=0 # Optional, defaults to 0
``` ```
4. **Start Redis** 4. **Run the development server**
```bash
redis-server
```
5. **Run the development server**
```bash ```bash
npm run dev npm run dev
``` ```
6. **Open your browser** 5. **Open your browser**
Navigate to [http://localhost:3000](http://localhost:3000) Navigate to [http://localhost:3000](http://localhost:3000)
@@ -108,9 +100,6 @@ npm run index-file wordlist.txt
# With custom batch size # With custom batch size
npm run index-file wordlist.txt -- --batch-size 500 npm run index-file wordlist.txt -- --batch-size 500
# Resume from last position
npm run index-file wordlist.txt -- --resume
# Show help # Show help
npm run index-file -- --help npm run index-file -- --help
``` ```
@@ -128,23 +117,7 @@ qwerty
- ✅ Progress indicator with percentage - ✅ Progress indicator with percentage
- ✅ Error handling and reporting - ✅ Error handling and reporting
- ✅ Performance metrics (docs/sec) - ✅ Performance metrics (docs/sec)
- ✅ State persistence for resume capability - ✅ Automatic index refresh
- ✅ Duplicate detection
### Remove Duplicates Script
Find and remove duplicate hash entries:
```bash
# Dry run (preview only)
npm run remove-duplicates -- --dry-run --field md5
# Execute removal
npm run remove-duplicates -- --execute --field sha256
# With custom batch size
npm run remove-duplicates -- --execute --field md5 --batch-size 100
```
## 🔌 API Reference ## 🔌 API Reference
@@ -185,7 +158,6 @@ Search for a hash or generate hashes from plaintext.
"found": true, "found": true,
"isPlaintext": true, "isPlaintext": true,
"plaintext": "password", "plaintext": "password",
"wasGenerated": false,
"hashes": { "hashes": {
"md5": "5f4dcc3b5aa765d61d8327deb882cf99", "md5": "5f4dcc3b5aa765d61d8327deb882cf99",
"sha1": "5baa61e4c9b93f3f0682250b6cf8331b7ee68fd8", "sha1": "5baa61e4c9b93f3f0682250b6cf8331b7ee68fd8",
@@ -199,60 +171,52 @@ Search for a hash or generate hashes from plaintext.
**GET** `/api/health` **GET** `/api/health`
Check Redis connection and database statistics. Check Elasticsearch connection and index status.
**Response**: **Response**:
```json ```json
{ {
"status": "ok", "status": "ok",
"redis": { "elasticsearch": {
"version": "7.2.4", "cluster": "elasticsearch",
"connected": true, "status": "green"
"memoryUsed": "1.5M",
"uptime": 3600
}, },
"database": { "index": {
"totalKeys": 1542, "exists": true,
"documentCount": 386, "name": "hasher",
"totalSize": 524288 "stats": {
"documentCount": 1542,
"indexSize": 524288
}
} }
} }
``` ```
## 🗄️ Redis Data Structure ## 🗄️ Elasticsearch Index
### Key Structures ### Index Configuration
The application uses the following Redis key patterns: - **Name**: `hasher`
- **Shards**: 10 (for horizontal scaling)
- **Replicas**: 1 (for redundancy)
1. **Hash Documents**: `hash:plaintext:{plaintext}` ### Mapping Schema
```json
{
"plaintext": "password",
"md5": "5f4dcc3b5aa765d61d8327deb882cf99",
"sha1": "5baa61e4c9b93f3f0682250b6cf8331b7ee68fd8",
"sha256": "5e884898da28047151d0e56f8dc6292773603d0d6aabbdd62a11ef721d1542d8",
"sha512": "b109f3bbbc244eb82441917ed06d618b9008dd09b3befd1b5e07394c706a8bb980b1d7785e5976ec049b46df5f1326af5a2ea6d103fd07c95385ffab0cacbc86",
"created_at": "2024-01-01T00:00:00.000Z"
}
```
2. **Hash Indexes**: `hash:index:{algorithm}:{hash}` ```json
- Points to the plaintext value {
- One index per hash algorithm (md5, sha1, sha256, sha512) "plaintext": {
"type": "text",
3. **Statistics**: `hash:stats` (Redis Hash) "analyzer": "lowercase_analyzer",
- `count`: Total number of documents "fields": {
- `size`: Total data size in bytes "keyword": { "type": "keyword" }
}
### Data Flow },
"md5": { "type": "keyword" },
``` "sha1": { "type": "keyword" },
Plaintext → Generate Hashes → Store Document "sha256": { "type": "keyword" },
"sha512": { "type": "keyword" },
Create 4 Indexes (one per algorithm) "created_at": { "type": "date" }
}
Update Statistics
``` ```
## 📁 Project Structure ## 📁 Project Structure
@@ -269,11 +233,10 @@ hasher/
│ ├── page.tsx # Main UI component │ ├── page.tsx # Main UI component
│ └── globals.css # Global styles │ └── globals.css # Global styles
├── lib/ ├── lib/
│ ├── redis.ts # Redis client & operations │ ├── elasticsearch.ts # ES client & index config
│ └── hash.ts # Hash utilities │ └── hash.ts # Hash utilities
├── scripts/ ├── scripts/
── index-file.ts # Bulk indexing script ── index-file.ts # Bulk indexing script
│ └── remove-duplicates.ts # Duplicate removal script
├── package.json ├── package.json
├── tsconfig.json ├── tsconfig.json
├── next.config.ts ├── next.config.ts
@@ -294,10 +257,7 @@ npm run start
Create a `.env.local` file: Create a `.env.local` file:
```env ```env
REDIS_HOST=localhost ELASTICSEARCH_NODE=http://localhost:9200
REDIS_PORT=6379
REDIS_PASSWORD=your_password # Optional
REDIS_DB=0 # Optional
``` ```
### Linting ### Linting
@@ -317,23 +277,10 @@ npm run lint
## 🚀 Performance ## 🚀 Performance
- **Bulk Indexing**: ~5000-15000 docs/sec (depending on hardware) - **Bulk Indexing**: ~1000-5000 docs/sec (depending on hardware)
- **Search Latency**: <5ms (typical) - **Search Latency**: <50ms (typical)
- **Memory Efficient**: In-memory storage with optional persistence - **Horizontal Scaling**: 10 shards for parallel processing
- **Atomic Operations**: Pipeline-based batch operations - **Auto-refresh**: Instant search availability for new documents
## 🔧 Redis Configuration
For optimal performance, consider these Redis settings:
```conf
# redis.conf
maxmemory 2gb
maxmemory-policy allkeys-lru
save 900 1
save 300 10
save 60 10000
```
## 🤝 Contributing ## 🤝 Contributing
@@ -352,7 +299,7 @@ This project is open source and available under the [MIT License](LICENSE).
## 🙏 Acknowledgments ## 🙏 Acknowledgments
- Built with [Next.js](https://nextjs.org/) - Built with [Next.js](https://nextjs.org/)
- Powered by [Redis](https://redis.io/) - Powered by [Elasticsearch](https://www.elastic.co/)
- Icons by [Lucide](https://lucide.dev/) - Icons by [Lucide](https://lucide.dev/)
- Styled with [Tailwind CSS](https://tailwindcss.com/) - Styled with [Tailwind CSS](https://tailwindcss.com/)
@@ -363,3 +310,4 @@ For issues, questions, or contributions, please open an issue on GitHub.
--- ---
**Made with ❤️ for the security and development community** **Made with ❤️ for the security and development community**

Ver fichero

@@ -9,7 +9,7 @@ This guide will help you quickly set up and test the Hasher application.
Ensure you have: Ensure you have:
- ✅ Node.js 18.x or higher (`node --version`) - ✅ Node.js 18.x or higher (`node --version`)
- ✅ npm (`npm --version`) - ✅ npm (`npm --version`)
-Redis 7.x or higher running on `localhost:6379` -Elasticsearch running on `localhost:9200`
### 2. Installation ### 2. Installation
@@ -20,16 +20,13 @@ cd hasher
# Install dependencies # Install dependencies
npm install npm install
# Start Redis (if not running)
redis-server
# Start the development server # Start the development server
npm run dev npm run dev
``` ```
The application will be available at: **http://localhost:3000** The application will be available at: **http://localhost:3000**
### 3. Verify Redis Connection ### 3. Verify Elasticsearch Connection
```bash ```bash
# Check health endpoint # Check health endpoint
@@ -40,11 +37,7 @@ Expected response:
```json ```json
{ {
"status": "ok", "status": "ok",
"redis": { "elasticsearch": { ... }
"version": "7.2.4",
"connected": true,
"memoryUsed": "1.5M"
}
} }
``` ```
@@ -91,19 +84,22 @@ npm run index-file sample-wordlist.txt
**Expected Output**: **Expected Output**:
``` ```
📚 Hasher Indexer - Redis Edition 📚 Hasher Indexer
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Redis: localhost:6379 Elasticsearch: http://localhost:9200
Index: hasher
File: sample-wordlist.txt File: sample-wordlist.txt
Batch size: 100 Batch size: 100
🔗 Connecting to Redis... 🔗 Connecting to Elasticsearch...
✅ Connected successfully ✅ Connected successfully
📖 Reading file... 📖 Reading file...
✅ Found 20 words/phrases to process ✅ Found 20 words/phrases to process
⏳ Progress: 20/20 (100.0%) - Indexed: 20, Skipped: 0, Errors: 0 ⏳ Progress: 20/20 (100.0%) - Indexed: 20, Errors: 0
🔄 Refreshing index...
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
✅ Indexing complete! ✅ Indexing complete!
@@ -118,16 +114,6 @@ After running the bulk indexer, search for:
All should return their plaintext values. All should return their plaintext values.
### Test 6: Remove Duplicates
```bash
# Dry run to preview duplicates
npm run remove-duplicates -- --dry-run --field md5
# Execute removal
npm run remove-duplicates -- --execute --field md5
```
--- ---
## 🔍 API Testing ## 🔍 API Testing
@@ -199,13 +185,13 @@ fetch('/api/search', {
- [ ] Results display correctly - [ ] Results display correctly
### Data Persistence ### Data Persistence
- [ ] New plaintext is saved to Redis - [ ] New plaintext is saved to Elasticsearch
- [ ] Saved hashes can be found in subsequent searches - [ ] Saved hashes can be found in subsequent searches
- [ ] Bulk indexing saves all entries - [ ] Bulk indexing saves all entries
- [ ] Duplicate detection works correctly - [ ] Index is created automatically if missing
### Error Handling ### Error Handling
- [ ] Redis connection errors are handled - [ ] Elasticsearch connection errors are handled
- [ ] Empty search queries are prevented - [ ] Empty search queries are prevented
- [ ] Invalid input is handled gracefully - [ ] Invalid input is handled gracefully
- [ ] Network errors show user-friendly messages - [ ] Network errors show user-friendly messages
@@ -214,20 +200,15 @@ fetch('/api/search', {
## 🐛 Common Issues & Solutions ## 🐛 Common Issues & Solutions
### Issue: Cannot connect to Redis ### Issue: Cannot connect to Elasticsearch
**Solution**: **Solution**:
```bash ```bash
# Check if Redis is running # Check if Elasticsearch is running
redis-cli ping curl http://localhost:9200
# Should respond: PONG
# If not running, start Redis # If not accessible, update the environment variable
redis-server export ELASTICSEARCH_NODE=http://your-elasticsearch-host:9200
# If using custom host/port, update environment variables
export REDIS_HOST=localhost
export REDIS_PORT=6379
npm run dev npm run dev
``` ```
@@ -261,48 +242,33 @@ npm run index-file -- "$(pwd)/sample-wordlist.txt"
--- ---
## 📊 Verify Data in Redis ## 📊 Verify Data in Elasticsearch
### Check Redis Connection ### Check Index Stats
```bash ```bash
redis-cli ping curl http://localhost:9200/hasher/_stats?pretty
``` ```
### Count Keys ### Count Documents
```bash ```bash
redis-cli DBSIZE curl http://localhost:9200/hasher/_count?pretty
``` ```
### View Sample Documents ### View Sample Documents
```bash ```bash
# List hash document keys curl http://localhost:9200/hasher/_search?pretty&size=5
redis-cli --scan --pattern "hash:plaintext:*" | head -5
# Get a specific document
redis-cli GET "hash:plaintext:password"
```
### Check Statistics
```bash
redis-cli HGETALL hash:stats
``` ```
### Search Specific Hash ### Search Specific Hash
```bash ```bash
# Find plaintext for an MD5 hash curl http://localhost:9200/hasher/_search?pretty -H 'Content-Type: application/json' -d'
redis-cli GET "hash:index:md5:5f4dcc3b5aa765d61d8327deb882cf99" {
"query": {
# Get the full document "term": {
redis-cli GET "hash:plaintext:password" "md5": "5f4dcc3b5aa765d61d8327deb882cf99"
``` }
}
### Monitor Redis Activity }'
```bash
# Watch commands in real-time
redis-cli MONITOR
# Check memory usage
redis-cli INFO memory
``` ```
--- ---
@@ -344,18 +310,9 @@ Create `search.json`:
``` ```
### Expected Performance ### Expected Performance
- Search latency: < 5ms - Search latency: < 100ms
- Bulk indexing: 5000-15000 docs/sec - Bulk indexing: 1000+ docs/sec
- Concurrent requests: 100+ - Concurrent requests: 50+
### Redis Performance Testing
```bash
# Benchmark Redis operations
redis-benchmark -t set,get -n 100000 -q
# Test with pipeline
redis-benchmark -t set,get -n 100000 -q -P 16
```
--- ---
@@ -372,13 +329,7 @@ redis-benchmark -t set,get -n 100000 -q -P 16
- [ ] CORS configuration - [ ] CORS configuration
- [ ] Rate limiting (if implemented) - [ ] Rate limiting (if implemented)
- [ ] Error message information disclosure - [ ] Error message information disclosure
- [ ] Redis authentication (if enabled) - [ ] Elasticsearch authentication (if enabled)
### Redis Security Checklist
- [ ] Redis password configured (REDIS_PASSWORD)
- [ ] Redis not exposed to internet
- [ ] Firewall rules configured
- [ ] TLS/SSL enabled (if needed)
--- ---
@@ -388,8 +339,7 @@ Before deploying to production:
- [ ] All tests passing - [ ] All tests passing
- [ ] Environment variables configured - [ ] Environment variables configured
- [ ] Redis secured with password - [ ] Elasticsearch secured and backed up
- [ ] Redis persistence configured (RDB/AOF)
- [ ] SSL/TLS certificates installed - [ ] SSL/TLS certificates installed
- [ ] Error logging configured - [ ] Error logging configured
- [ ] Monitoring set up - [ ] Monitoring set up
@@ -397,7 +347,6 @@ Before deploying to production:
- [ ] Security review done - [ ] Security review done
- [ ] Documentation reviewed - [ ] Documentation reviewed
- [ ] Backup strategy in place - [ ] Backup strategy in place
- [ ] Redis memory limits configured
--- ---
@@ -408,7 +357,7 @@ Before deploying to production:
## Environment ## Environment
- Node.js version: - Node.js version:
- Redis version: - Elasticsearch version:
- Browser(s) tested: - Browser(s) tested:
## Test Results ## Test Results
@@ -418,7 +367,6 @@ Before deploying to production:
- [ ] Hash search: PASS/FAIL - [ ] Hash search: PASS/FAIL
- [ ] Bulk indexing: PASS/FAIL - [ ] Bulk indexing: PASS/FAIL
- [ ] API endpoints: PASS/FAIL - [ ] API endpoints: PASS/FAIL
- [ ] Duplicate removal: PASS/FAIL
### Issues Found ### Issues Found
1. [Description] 1. [Description]
@@ -431,7 +379,6 @@ Before deploying to production:
- Average search time: - Average search time:
- Bulk index rate: - Bulk index rate:
- Concurrent users tested: - Concurrent users tested:
- Redis memory usage:
## Conclusion ## Conclusion
[Summary of testing] [Summary of testing]
@@ -447,8 +394,7 @@ After successful testing:
2. ✅ Fix any issues found 2. ✅ Fix any issues found
3. ✅ Perform load testing 3. ✅ Perform load testing
4. ✅ Review security 4. ✅ Review security
5.Configure Redis persistence 5.Prepare for deployment
6. ✅ Prepare for deployment
See [DEPLOYMENT.md](DEPLOYMENT.md) for deployment instructions. See [DEPLOYMENT.md](DEPLOYMENT.md) for deployment instructions.

Ver fichero

@@ -1,24 +1,34 @@
import { NextResponse } from 'next/server'; import { NextResponse } from 'next/server';
import { getRedisInfo, getStats, INDEX_NAME } from '@/lib/redis'; import { esClient, INDEX_NAME } from '@/lib/elasticsearch';
export async function GET() { export async function GET() {
try { try {
// Check Redis connection and get info // Check Elasticsearch connection
const redisInfo = await getRedisInfo(); const health = await esClient.cluster.health({});
// Get stats // Check if index exists
const stats = await getStats(); const indexExists = await esClient.indices.exists({ index: INDEX_NAME });
// Get index stats if exists
let stats = null;
if (indexExists) {
const statsResponse = await esClient.indices.stats({ index: INDEX_NAME });
stats = {
documentCount: statsResponse._all?.primaries?.docs?.count || 0,
indexSize: statsResponse._all?.primaries?.store?.size_in_bytes || 0
};
}
return NextResponse.json({ return NextResponse.json({
status: 'ok', status: 'ok',
redis: { elasticsearch: {
version: redisInfo.version, cluster: health.cluster_name,
memory: redisInfo.memory, status: health.status,
dbSize: redisInfo.dbSize
}, },
stats: { index: {
count: stats.count, exists: indexExists,
size: stats.size name: INDEX_NAME,
stats
} }
}); });
} catch (error) { } catch (error) {

Ver fichero

@@ -1,52 +1,152 @@
import { NextRequest, NextResponse } from 'next/server'; import { NextRequest, NextResponse } from 'next/server';
import { storeHashDocument, findByPlaintext, findByHash, initializeRedis } from '@/lib/redis'; import { esClient, INDEX_NAME, initializeIndex } from '@/lib/elasticsearch';
import { generateHashes, detectHashType } from '@/lib/hash'; import { generateHashes, detectHashType } from '@/lib/hash';
interface HashDocument {
plaintext: string;
md5: string;
sha1: string;
sha256: string;
sha512: string;
created_at?: string;
}
// Maximum allowed query length
const MAX_QUERY_LENGTH = 1000;
// Characters that could be used in NoSQL/Elasticsearch injection attacks
const DANGEROUS_PATTERNS = [
/[{}\[\]]/g, // JSON structure characters
/\$[a-zA-Z]/g, // MongoDB-style operators
/\\u[0-9a-fA-F]{4}/g, // Unicode escapes
/<script/gi, // XSS attempts
/javascript:/gi, // XSS attempts
];
/**
* Sanitize input to prevent NoSQL injection attacks
* For hash lookups, we only need alphanumeric characters and $
* For plaintext, we allow more characters but sanitize dangerous patterns
*/
function sanitizeInput(input: string): string {
// Trim and take first word only
let sanitized = input.trim().split(/\s+/)[0] || '';
// Limit length
if (sanitized.length > MAX_QUERY_LENGTH) {
sanitized = sanitized.substring(0, MAX_QUERY_LENGTH);
}
// Remove null bytes
sanitized = sanitized.replace(/\0/g, '');
// Check for dangerous patterns
for (const pattern of DANGEROUS_PATTERNS) {
sanitized = sanitized.replace(pattern, '');
}
return sanitized;
}
/**
* Validate that the input is safe for use in Elasticsearch queries
*/
function isValidInput(input: string): boolean {
// Check for empty input
if (!input || input.length === 0) {
return false;
}
// Check for excessively long input
if (input.length > MAX_QUERY_LENGTH) {
return false;
}
// Check for control characters (except normal whitespace)
if (/[\x00-\x08\x0B\x0C\x0E-\x1F\x7F]/.test(input)) {
return false;
}
return true;
}
export async function POST(request: NextRequest) { export async function POST(request: NextRequest) {
try { try {
const { query } = await request.json(); const body = await request.json();
if (!query || typeof query !== 'string') { // Validate request body structure
if (!body || typeof body !== 'object') {
return NextResponse.json( return NextResponse.json(
{ error: 'Query parameter is required' }, { error: 'Invalid request body' },
{ status: 400 } { status: 400 }
); );
} }
// Ensure Redis is connected const { query } = body;
await initializeRedis();
const cleanQuery = query.trim().split(/\s+/)[0]; // Validate query type
if (!query || typeof query !== 'string') {
return NextResponse.json(
{ error: 'Query parameter is required and must be a string' },
{ status: 400 }
);
}
// Validate input before processing
if (!isValidInput(query)) {
return NextResponse.json(
{ error: 'Invalid query: contains forbidden characters or is too long' },
{ status: 400 }
);
}
// Sanitize input
const cleanQuery = sanitizeInput(query);
if (!cleanQuery) { if (!cleanQuery) {
return NextResponse.json( return NextResponse.json(
{ error: 'Invalid query: only whitespace provided' }, { error: 'Invalid query: only whitespace or invalid characters provided' },
{ status: 400 } { status: 400 }
); );
} }
// Ensure index exists
await initializeIndex();
const cleanQueryLower = cleanQuery.toLowerCase(); const cleanQueryLower = cleanQuery.toLowerCase();
const hashType = detectHashType(cleanQueryLower); const hashType = detectHashType(cleanQueryLower);
if (hashType) { if (hashType) {
// Query is a hash - search for it in Redis // Query is a hash - search for it in Elasticsearch
const doc = await findByHash(hashType, cleanQueryLower); const searchResponse = await esClient.search<HashDocument>({
index: INDEX_NAME,
query: {
term: {
[hashType]: cleanQueryLower
}
}
});
if (doc) { const hits = searchResponse.hits.hits;
if (hits.length > 0) {
// Found matching plaintext // Found matching plaintext
return NextResponse.json({ return NextResponse.json({
found: true, found: true,
hashType, hashType,
hash: cleanQuery, hash: cleanQuery,
results: [{ results: hits.map((hit) => {
plaintext: doc.plaintext, const source = hit._source!;
hashes: { return {
md5: doc.md5, plaintext: source.plaintext,
sha1: doc.sha1, hashes: {
sha256: doc.sha256, md5: source.md5,
sha512: doc.sha512, sha1: source.sha1,
} sha256: source.sha256,
}] sha512: source.sha512,
}
};
})
}); });
} else { } else {
// Hash not found in database // Hash not found in database
@@ -59,13 +159,20 @@ export async function POST(request: NextRequest) {
} }
} else { } else {
// Query is plaintext - check if it already exists first // Query is plaintext - check if it already exists first
const existingDoc = await findByPlaintext(cleanQuery); const existsResponse = await esClient.search<HashDocument>({
index: INDEX_NAME,
query: {
term: {
'plaintext.keyword': cleanQuery
}
}
});
let hashes; let hashes;
let wasGenerated = false;
if (existingDoc) { if (existsResponse.hits.hits.length > 0) {
// Plaintext found, retrieve existing hashes // Plaintext found, retrieve existing hashes
const existingDoc = existsResponse.hits.hits[0]._source!;
hashes = { hashes = {
md5: existingDoc.md5, md5: existingDoc.md5,
sha1: existingDoc.sha1, sha1: existingDoc.sha1,
@@ -73,22 +180,44 @@ export async function POST(request: NextRequest) {
sha512: existingDoc.sha512, sha512: existingDoc.sha512,
}; };
} else { } else {
// Plaintext not found, generate and store hashes // Plaintext not found, generate hashes and check if any hash already exists
hashes = await generateHashes(cleanQuery); hashes = generateHashes(cleanQuery);
await storeHashDocument({ const hashExistsResponse = await esClient.search<HashDocument>({
...hashes, index: INDEX_NAME,
created_at: new Date().toISOString() query: {
bool: {
should: [
{ term: { md5: hashes.md5 } },
{ term: { sha1: hashes.sha1 } },
{ term: { sha256: hashes.sha256 } },
{ term: { sha512: hashes.sha512 } },
],
minimum_should_match: 1
}
}
}); });
wasGenerated = true; if (hashExistsResponse.hits.hits.length === 0) {
// No duplicates found, insert new document
await esClient.index({
index: INDEX_NAME,
document: {
...hashes,
created_at: new Date().toISOString()
}
});
// Refresh index to make the document searchable immediately
await esClient.indices.refresh({ index: INDEX_NAME });
}
} }
return NextResponse.json({ return NextResponse.json({
found: true, found: true,
isPlaintext: true, isPlaintext: true,
plaintext: cleanQuery, plaintext: cleanQuery,
wasGenerated, wasGenerated: existsResponse.hits.hits.length === 0,
hashes: { hashes: {
md5: hashes.md5, md5: hashes.md5,
sha1: hashes.sha1, sha1: hashes.sha1,

Ver fichero

@@ -14,8 +14,8 @@ const geistMono = Geist_Mono({
export const metadata: Metadata = { export const metadata: Metadata = {
title: "Hasher - Hash Search & Generator", title: "Hasher - Hash Search & Generator",
description: "Search for hashes or generate them from plaintext. Supports MD5, SHA1, SHA256, and SHA512. Powered by Redis.", description: "Search for hashes or generate them from plaintext. Supports MD5, SHA1, SHA256, and SHA512. Powered by Elasticsearch.",
keywords: ["hash", "md5", "sha1", "sha256", "sha512", "hash generator", "hash search", "redis"], keywords: ["hash", "md5", "sha1", "sha256", "sha512", "hash generator", "hash search", "elasticsearch"],
authors: [{ name: "Hasher" }], authors: [{ name: "Hasher" }],
creator: "Hasher", creator: "Hasher",
publisher: "Hasher", publisher: "Hasher",

Ver fichero

@@ -359,7 +359,7 @@ function HasherContent() {
{/* Footer */} {/* Footer */}
<footer className="mt-16 text-center text-gray-500 text-sm"> <footer className="mt-16 text-center text-gray-500 text-sm">
<p>Powered by Redis Built with Next.js</p> <p>Powered by Elasticsearch Built with Next.js</p>
</footer> </footer>
</div> </div>
</div> </div>

76
lib/elasticsearch.ts Archivo normal
Ver fichero

@@ -0,0 +1,76 @@
import { Client } from '@elastic/elasticsearch';
const ELASTICSEARCH_NODE = process.env.ELASTICSEARCH_NODE || 'http://localhost:9200';
const INDEX_NAME = 'hasher';
export const esClient = new Client({
node: ELASTICSEARCH_NODE,
requestTimeout: 30000,
maxRetries: 3,
});
export const INDEX_MAPPING = {
settings: {
number_of_shards: 10,
number_of_replicas: 1,
analysis: {
analyzer: {
lowercase_analyzer: {
type: 'custom' as const,
tokenizer: 'keyword',
filter: ['lowercase']
}
}
}
},
mappings: {
properties: {
plaintext: {
type: 'text' as const,
analyzer: 'lowercase_analyzer',
fields: {
keyword: {
type: 'keyword' as const
}
}
},
md5: {
type: 'keyword' as const
},
sha1: {
type: 'keyword' as const
},
sha256: {
type: 'keyword' as const
},
sha512: {
type: 'keyword' as const
},
created_at: {
type: 'date' as const
}
}
}
};
export async function initializeIndex(): Promise<void> {
try {
const indexExists = await esClient.indices.exists({ index: INDEX_NAME });
if (!indexExists) {
await esClient.indices.create({
index: INDEX_NAME,
settings: INDEX_MAPPING.settings,
mappings: INDEX_MAPPING.mappings
});
console.log(`Index '${INDEX_NAME}' created successfully with 10 shards`);
} else {
console.log(`Index '${INDEX_NAME}' already exists`);
}
} catch (error) {
console.error('Error initializing Elasticsearch index:', error);
throw error;
}
}
export { INDEX_NAME };

Ver fichero

@@ -1,181 +0,0 @@
import Redis from 'ioredis';
const REDIS_HOST = process.env.REDIS_HOST || 'localhost';
const REDIS_PORT = parseInt(process.env.REDIS_PORT || '6379', 10);
const REDIS_PASSWORD = process.env.REDIS_PASSWORD || undefined;
const REDIS_DB = parseInt(process.env.REDIS_DB || '0', 10);
export const INDEX_NAME = 'hasher';
// Create Redis client with connection pooling
export const redisClient = new Redis({
host: REDIS_HOST,
port: REDIS_PORT,
password: REDIS_PASSWORD,
db: REDIS_DB,
retryStrategy: (times) => {
const delay = Math.min(times * 50, 2000);
return delay;
},
maxRetriesPerRequest: 3,
enableReadyCheck: true,
lazyConnect: false,
});
// Handle connection errors
redisClient.on('error', (err) => {
console.error('Redis Client Error:', err);
});
redisClient.on('connect', () => {
console.log('Redis connected successfully');
});
/**
* Redis Keys Structure:
*
* 1. Hash documents: hash:plaintext:{plaintext} = JSON string
* - Stores all hash data for a plaintext
*
* 2. Hash indexes: hash:index:{algorithm}:{hash} = plaintext
* - Allows reverse lookup from hash to plaintext
* - One key per algorithm (md5, sha1, sha256, sha512)
*
* 3. Statistics: hash:stats = Hash {count, size}
* - count: total number of unique plaintexts
* - size: approximate total size in bytes
*/
export interface HashDocument {
plaintext: string;
md5: string;
sha1: string;
sha256: string;
sha512: string;
created_at: string;
}
/**
* Store a hash document in Redis
*/
export async function storeHashDocument(doc: HashDocument): Promise<void> {
const pipeline = redisClient.pipeline();
// Store main document
const key = `hash:plaintext:${doc.plaintext}`;
pipeline.set(key, JSON.stringify(doc));
// Create indexes for each hash type
pipeline.set(`hash:index:md5:${doc.md5}`, doc.plaintext);
pipeline.set(`hash:index:sha1:${doc.sha1}`, doc.plaintext);
pipeline.set(`hash:index:sha256:${doc.sha256}`, doc.plaintext);
pipeline.set(`hash:index:sha512:${doc.sha512}`, doc.plaintext);
// Update statistics
pipeline.hincrby('hash:stats', 'count', 1);
pipeline.hincrby('hash:stats', 'size', JSON.stringify(doc).length);
await pipeline.exec();
}
/**
* Find a hash document by plaintext
*/
export async function findByPlaintext(plaintext: string): Promise<HashDocument | null> {
const key = `hash:plaintext:${plaintext}`;
const data = await redisClient.get(key);
if (!data) return null;
return JSON.parse(data) as HashDocument;
}
/**
* Find a hash document by any hash value
*/
export async function findByHash(algorithm: string, hash: string): Promise<HashDocument | null> {
const indexKey = `hash:index:${algorithm}:${hash}`;
const plaintext = await redisClient.get(indexKey);
if (!plaintext) return null;
return findByPlaintext(plaintext);
}
/**
* Check if a plaintext or any of its hashes exist
*/
export async function checkExistence(plaintext: string, hashes?: {
md5?: string;
sha1?: string;
sha256?: string;
sha512?: string;
}): Promise<boolean> {
// Check if plaintext exists
const plaintextKey = `hash:plaintext:${plaintext}`;
const exists = await redisClient.exists(plaintextKey);
if (exists) return true;
// Check if any hash exists
if (hashes) {
const pipeline = redisClient.pipeline();
if (hashes.md5) pipeline.exists(`hash:index:md5:${hashes.md5}`);
if (hashes.sha1) pipeline.exists(`hash:index:sha1:${hashes.sha1}`);
if (hashes.sha256) pipeline.exists(`hash:index:sha256:${hashes.sha256}`);
if (hashes.sha512) pipeline.exists(`hash:index:sha512:${hashes.sha512}`);
const results = await pipeline.exec();
if (results && results.some(([_err, result]) => result === 1)) {
return true;
}
}
return false;
}
/**
* Get database statistics
*/
export async function getStats(): Promise<{ count: number; size: number }> {
const stats = await redisClient.hgetall('hash:stats');
return {
count: parseInt(stats.count || '0', 10),
size: parseInt(stats.size || '0', 10),
};
}
/**
* Get Redis server info
*/
export async function getRedisInfo(): Promise<{
version: string;
memory: string;
dbSize: number;
}> {
const info = await redisClient.info('server');
const memory = await redisClient.info('memory');
const dbSize = await redisClient.dbsize();
const versionMatch = info.match(/redis_version:([^\r\n]+)/);
const memoryMatch = memory.match(/used_memory_human:([^\r\n]+)/);
return {
version: versionMatch ? versionMatch[1] : 'unknown',
memory: memoryMatch ? memoryMatch[1] : 'unknown',
dbSize,
};
}
/**
* Initialize Redis connection (just verify it's working)
*/
export async function initializeRedis(): Promise<void> {
try {
await redisClient.ping();
console.log('Redis connection verified');
} catch (error) {
console.error('Error connecting to Redis:', error);
throw error;
}
}

Ver fichero

@@ -1,14 +1,14 @@
{ {
"name": "hasher", "name": "hasher",
"version": "1.0.0", "version": "1.0.0",
"description": "A modern hash search and generation tool powered by Redis and Next.js", "description": "A modern hash search and generation tool powered by Elasticsearch and Next.js",
"keywords": [ "keywords": [
"hash", "hash",
"md5", "md5",
"sha1", "sha1",
"sha256", "sha256",
"sha512", "sha512",
"redis", "elasticsearch",
"nextjs", "nextjs",
"cryptography", "cryptography",
"security", "security",
@@ -38,7 +38,7 @@
"remove-duplicates": "tsx scripts/remove-duplicates.ts" "remove-duplicates": "tsx scripts/remove-duplicates.ts"
}, },
"dependencies": { "dependencies": {
"ioredis": "^5.4.2", "@elastic/elasticsearch": "^9.2.0",
"lucide-react": "^0.555.0", "lucide-react": "^0.555.0",
"next": "15.4.8", "next": "15.4.8",
"react": "19.1.2", "react": "19.1.2",

Ver fichero

@@ -4,7 +4,7 @@
* Hasher Indexer Script * Hasher Indexer Script
* *
* This script reads a text file with one word/phrase per line and indexes * This script reads a text file with one word/phrase per line and indexes
* all the generated hashes into Redis. * all the generated hashes into Elasticsearch.
* *
* Usage: * Usage:
* npx tsx scripts/index-file.ts <path-to-file.txt> [options] * npx tsx scripts/index-file.ts <path-to-file.txt> [options]
@@ -19,16 +19,14 @@
* --help, -h Show this help message * --help, -h Show this help message
*/ */
import Redis from 'ioredis'; import { Client } from '@elastic/elasticsearch';
import { createReadStream, existsSync, readFileSync, writeFileSync, unlinkSync, openSync, readSync, closeSync } from 'fs'; import { createReadStream, existsSync, readFileSync, writeFileSync, unlinkSync } from 'fs';
import { resolve, basename } from 'path'; import { resolve, basename } from 'path';
import { createInterface } from 'readline'; import { createInterface } from 'readline';
import * as crypto from 'crypto'; import crypto from 'crypto';
const REDIS_HOST = process.env.REDIS_HOST || 'localhost'; const ELASTICSEARCH_NODE = process.env.ELASTICSEARCH_NODE || 'http://localhost:9200';
const REDIS_PORT = parseInt(process.env.REDIS_PORT || '6379', 10); const INDEX_NAME = 'hasher';
const REDIS_PASSWORD = process.env.REDIS_PASSWORD || undefined;
const REDIS_DB = parseInt(process.env.REDIS_DB || '0', 10);
const DEFAULT_BATCH_SIZE = 100; const DEFAULT_BATCH_SIZE = 100;
interface HashDocument { interface HashDocument {
@@ -89,12 +87,13 @@ function parseArgs(args: string[]): ParsedArgs {
result.batchSize = parsed; result.batchSize = parsed;
} }
} else if (arg === '--batch-size') { } else if (arg === '--batch-size') {
// Support --batch-size <value> format
const nextArg = args[i + 1]; const nextArg = args[i + 1];
if (nextArg && !nextArg.startsWith('-')) { if (nextArg && !nextArg.startsWith('-')) {
const parsed = parseInt(nextArg, 10); const parsed = parseInt(nextArg, 10);
if (!isNaN(parsed) && parsed > 0) { if (!isNaN(parsed) && parsed > 0) {
result.batchSize = parsed; result.batchSize = parsed;
i++; i++; // Skip next argument
} }
} }
} else if (arg.startsWith('--state-file=')) { } else if (arg.startsWith('--state-file=')) {
@@ -106,6 +105,7 @@ function parseArgs(args: string[]): ParsedArgs {
i++; i++;
} }
} else if (!arg.startsWith('-')) { } else if (!arg.startsWith('-')) {
// Positional argument - treat as file path
result.filePath = arg; result.filePath = arg;
} }
} }
@@ -113,6 +113,49 @@ function parseArgs(args: string[]): ParsedArgs {
return result; return result;
} }
function getFileHash(filePath: string): string {
// Create a hash based on file path and size for quick identification
const stats = require('fs').statSync(filePath);
const hashInput = `${filePath}:${stats.size}:${stats.mtime.getTime()}`;
return crypto.createHash('md5').update(hashInput).digest('hex').substring(0, 8);
}
function getDefaultStateFile(filePath: string): string {
const fileName = basename(filePath).replace(/\.[^.]+$/, '');
return resolve(`.indexer-state-${fileName}.json`);
}
function loadState(stateFile: string): IndexerState | null {
try {
if (existsSync(stateFile)) {
const data = readFileSync(stateFile, 'utf-8');
return JSON.parse(data) as IndexerState;
}
} catch (error) {
console.warn(`⚠️ Could not load state file: ${error}`);
}
return null;
}
function saveState(stateFile: string, state: IndexerState): void {
try {
state.lastUpdate = new Date().toISOString();
writeFileSync(stateFile, JSON.stringify(state, null, 2), 'utf-8');
} catch (error) {
console.error(`❌ Could not save state file: ${error}`);
}
}
function deleteState(stateFile: string): void {
try {
if (existsSync(stateFile)) {
unlinkSync(stateFile);
}
} catch (error) {
console.warn(`⚠️ Could not delete state file: ${error}`);
}
}
function generateHashes(plaintext: string): HashDocument { function generateHashes(plaintext: string): HashDocument {
return { return {
plaintext, plaintext,
@@ -142,187 +185,74 @@ Options:
--help, -h Show this help message --help, -h Show this help message
Environment Variables: Environment Variables:
REDIS_HOST Redis host (default: localhost) ELASTICSEARCH_NODE Elasticsearch node URL (default: http://localhost:9200)
REDIS_PORT Redis port (default: 6379)
REDIS_PASSWORD Redis password (optional)
REDIS_DB Redis database number (default: 0)
Examples: Examples:
# Index a file with default settings npx tsx scripts/index-file.ts wordlist.txt
npm run index-file -- wordlist.txt npx tsx scripts/index-file.ts wordlist.txt --batch-size=500
npx tsx scripts/index-file.ts wordlist.txt --batch-size 500
npx tsx scripts/index-file.ts wordlist.txt --no-resume
npx tsx scripts/index-file.ts wordlist.txt --no-check
npm run index-file -- wordlist.txt --batch-size=500 --no-check
# Index with custom batch size State Management:
npm run index-file -- wordlist.txt --batch-size=500 The script automatically saves progress to a state file. If interrupted,
it will resume from where it left off on the next run. Use --no-resume
to start fresh.
# Start fresh (ignore previous state) Duplicate Checking:
npm run index-file -- wordlist.txt --no-resume By default, the script checks if each plaintext or hash already exists
in the index before inserting. Use --no-check to skip this verification
# Skip duplicate checking for speed for faster indexing (useful when you're sure there are no duplicates).
npm run index-file -- wordlist.txt --no-check
`); `);
process.exit(0);
} }
function computeFileHash(filePath: string): string { async function indexFile(filePath: string, batchSize: number, shouldResume: boolean, checkDuplicates: boolean, customStateFile: string | null) {
// Use streaming for large files to avoid memory issues const client = new Client({ node: ELASTICSEARCH_NODE });
const hash = crypto.createHash('sha256');
const input = createReadStream(filePath, { highWaterMark: 64 * 1024 }); // 64KB chunks
let buffer = Buffer.alloc(0);
const fd = openSync(filePath, 'r');
const chunkSize = 64 * 1024; // 64KB
const readBuffer = Buffer.alloc(chunkSize);
try {
let bytesRead;
do {
bytesRead = readSync(fd, readBuffer, 0, chunkSize, null);
if (bytesRead > 0) {
hash.update(readBuffer.subarray(0, bytesRead));
}
} while (bytesRead > 0);
} finally {
closeSync(fd);
}
return hash.digest('hex');
}
function getStateFilePath(filePath: string, customPath: string | null): string {
if (customPath) {
return resolve(customPath);
}
const fileName = basename(filePath);
return resolve(`.indexer-state-${fileName}.json`);
}
function loadState(stateFilePath: string): IndexerState | null {
if (!existsSync(stateFilePath)) {
return null;
}
try {
const data = readFileSync(stateFilePath, 'utf-8');
return JSON.parse(data);
} catch (error) {
console.warn(`⚠️ Could not load state file: ${error}`);
return null;
}
}
function saveState(stateFilePath: string, state: IndexerState): void {
try {
writeFileSync(stateFilePath, JSON.stringify(state, null, 2), 'utf-8');
} catch (error) {
console.error(`❌ Could not save state file: ${error}`);
}
}
function deleteState(stateFilePath: string): void {
try {
if (existsSync(stateFilePath)) {
unlinkSync(stateFilePath);
}
} catch (error) {
console.warn(`⚠️ Could not delete state file: ${error}`);
}
}
async function countLines(filePath: string): Promise<number> {
return new Promise((resolve, reject) => {
let lineCount = 0;
const rl = createInterface({
input: createReadStream(filePath),
crlfDelay: Infinity
});
rl.on('line', () => lineCount++);
rl.on('close', () => resolve(lineCount));
rl.on('error', reject);
});
}
async function main() {
const args = process.argv.slice(2);
const parsed = parseArgs(args);
if (parsed.showHelp || !parsed.filePath) {
showHelp();
process.exit(parsed.showHelp ? 0 : 1);
}
const filePath = parsed.filePath!;
const batchSize = parsed.batchSize;
const checkDuplicates = parsed.checkDuplicates;
const absolutePath = resolve(filePath); const absolutePath = resolve(filePath);
const stateFile = customStateFile || getDefaultStateFile(absolutePath);
const fileHash = getFileHash(absolutePath);
if (!existsSync(absolutePath)) { // State management
console.error(`❌ File not found: ${absolutePath}`); let state: IndexerState = {
process.exit(1); filePath: absolutePath,
} fileHash,
lastProcessedLine: 0,
totalLines: 0,
indexed: 0,
skipped: 0,
errors: 0,
startTime: Date.now(),
lastUpdate: new Date().toISOString()
};
const stateFile = getStateFilePath(filePath, parsed.stateFile); // Check for existing state
const fileHash = computeFileHash(absolutePath); const existingState = loadState(stateFile);
let state: IndexerState;
let resumingFrom = 0; let resumingFrom = 0;
if (parsed.resume) { if (shouldResume && existingState) {
const loadedState = loadState(stateFile); if (existingState.fileHash === fileHash) {
if (loadedState && loadedState.fileHash === fileHash) { state = existingState;
state = loadedState;
resumingFrom = state.lastProcessedLine; resumingFrom = state.lastProcessedLine;
console.log(`📂 Resuming from previous state: ${stateFile}`); state.startTime = Date.now(); // Reset start time for this session
console.log(`📂 Found existing state, resuming from line ${resumingFrom}`);
} else { } else {
if (loadedState) { console.log(`⚠️ File has changed since last run, starting fresh`);
console.log('⚠️ File has changed or state file is from a different file. Starting fresh.'); deleteState(stateFile);
}
state = {
filePath: absolutePath,
fileHash,
lastProcessedLine: 0,
totalLines: 0,
indexed: 0,
skipped: 0,
errors: 0,
startTime: Date.now(),
lastUpdate: new Date().toISOString()
};
} }
} else { } else if (!shouldResume) {
deleteState(stateFile); deleteState(stateFile);
state = {
filePath: absolutePath,
fileHash,
lastProcessedLine: 0,
totalLines: 0,
indexed: 0,
skipped: 0,
errors: 0,
startTime: Date.now(),
lastUpdate: new Date().toISOString()
};
} }
if (state.totalLines === 0) { console.log(`📚 Hasher Indexer`);
console.log('🔢 Counting lines...'); console.log(`━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━`);
state.totalLines = await countLines(absolutePath); console.log(`Elasticsearch: ${ELASTICSEARCH_NODE}`);
} console.log(`Index: ${INDEX_NAME}`);
const client = new Redis({
host: REDIS_HOST,
port: REDIS_PORT,
password: REDIS_PASSWORD,
db: REDIS_DB,
});
console.log('');
console.log('📚 Hasher Indexer');
console.log('━'.repeat(42));
console.log(`Redis: ${REDIS_HOST}:${REDIS_PORT}`);
console.log(`File: ${filePath}`); console.log(`File: ${filePath}`);
console.log(`Batch size: ${batchSize}`); console.log(`Batch size: ${batchSize}`);
console.log(`Duplicate check: ${checkDuplicates ? 'enabled' : 'disabled (--no-check)'}`); console.log(`Check duplicates: ${checkDuplicates ? 'yes' : 'no (--no-check)'}`);
console.log(`State file: ${stateFile}`);
if (resumingFrom > 0) { if (resumingFrom > 0) {
console.log(`Resuming from: line ${resumingFrom}`); console.log(`Resuming from: line ${resumingFrom}`);
console.log(`Already indexed: ${state.indexed}`); console.log(`Already indexed: ${state.indexed}`);
@@ -330,6 +260,7 @@ async function main() {
} }
console.log(''); console.log('');
// Handle interruption signals
let isInterrupted = false; let isInterrupted = false;
const handleInterrupt = () => { const handleInterrupt = () => {
if (isInterrupted) { if (isInterrupted) {
@@ -341,6 +272,7 @@ async function main() {
saveState(stateFile, state); saveState(stateFile, state);
console.log(`💾 State saved to ${stateFile}`); console.log(`💾 State saved to ${stateFile}`);
console.log(` Resume with: npx tsx scripts/index-file.ts ${filePath}`); console.log(` Resume with: npx tsx scripts/index-file.ts ${filePath}`);
console.log(` Or start fresh with: npx tsx scripts/index-file.ts ${filePath} --no-resume`);
process.exit(0); process.exit(0);
}; };
@@ -348,11 +280,13 @@ async function main() {
process.on('SIGTERM', handleInterrupt); process.on('SIGTERM', handleInterrupt);
try { try {
console.log('🔗 Connecting to Redis...'); // Test connection
await client.ping(); console.log('🔗 Connecting to Elasticsearch...');
await client.cluster.health({});
console.log('✅ Connected successfully\n'); console.log('✅ Connected successfully\n');
console.log('📖 Reading file...\n'); // Process file line by line using streams
console.log('📖 Processing file...\n');
let currentLineNumber = 0; let currentLineNumber = 0;
let currentBatch: string[] = []; let currentBatch: string[] = [];
@@ -367,128 +301,219 @@ async function main() {
crlfDelay: Infinity crlfDelay: Infinity
}); });
const processBatch = async (batch: string[]) => { const processBatch = async (batch: string[], lineNumber: number) => {
if (batch.length === 0 || isInterrupted) return; if (batch.length === 0) return;
if (isInterrupted) return;
const batchWithHashes = batch.map(plaintext => generateHashes(plaintext)); const bulkOperations: any[] = [];
let toIndex = batchWithHashes; // Generate hashes for all items in batch first
const batchWithHashes = batch.map((plaintext: string) => ({
plaintext,
hashes: generateHashes(plaintext)
}));
if (checkDuplicates) { if (checkDuplicates) {
const existenceChecks = await Promise.all( // Check which items already exist (by plaintext or any hash)
batchWithHashes.map(doc => client.exists(`hash:plaintext:${doc.plaintext}`)) const md5List = batchWithHashes.map((item: any) => item.hashes.md5);
); const sha1List = batchWithHashes.map((item: any) => item.hashes.sha1);
const sha256List = batchWithHashes.map((item: any) => item.hashes.sha256);
const sha512List = batchWithHashes.map((item: any) => item.hashes.sha512);
const newDocs = batchWithHashes.filter((_doc, idx) => existenceChecks[idx] === 0); const existingCheck = await client.search({
const existingCount = batchWithHashes.length - newDocs.length; index: INDEX_NAME,
size: batchSize * 5,
query: {
bool: {
should: [
{ terms: { 'plaintext.keyword': batch } },
{ terms: { md5: md5List } },
{ terms: { sha1: sha1List } },
{ terms: { sha256: sha256List } },
{ terms: { sha512: sha512List } },
],
minimum_should_match: 1
}
},
_source: ['plaintext', 'md5', 'sha1', 'sha256', 'sha512']
});
state.skipped += existingCount; // Create a set of existing hashes for quick lookup
sessionSkipped += existingCount; const existingHashes = new Set<string>();
toIndex = newDocs; existingCheck.hits.hits.forEach((hit: any) => {
} const src = hit._source;
existingHashes.add(src.plaintext);
existingHashes.add(src.md5);
existingHashes.add(src.sha1);
existingHashes.add(src.sha256);
existingHashes.add(src.sha512);
});
if (toIndex.length > 0) { // Prepare bulk operations only for items that don't have any duplicate hash
const pipeline = client.pipeline(); for (const item of batchWithHashes) {
const isDuplicate =
existingHashes.has(item.plaintext) ||
existingHashes.has(item.hashes.md5) ||
existingHashes.has(item.hashes.sha1) ||
existingHashes.has(item.hashes.sha256) ||
existingHashes.has(item.hashes.sha512);
for (const doc of toIndex) { if (!isDuplicate) {
const key = `hash:plaintext:${doc.plaintext}`; bulkOperations.push({ index: { _index: INDEX_NAME } });
bulkOperations.push(item.hashes);
pipeline.set(key, JSON.stringify(doc)); } else {
state.skipped++;
pipeline.set(`hash:index:md5:${doc.md5}`, doc.plaintext); sessionSkipped++;
pipeline.set(`hash:index:sha1:${doc.sha1}`, doc.plaintext); }
pipeline.set(`hash:index:sha256:${doc.sha256}`, doc.plaintext);
pipeline.set(`hash:index:sha512:${doc.sha512}`, doc.plaintext);
pipeline.hincrby('hash:stats', 'count', 1);
pipeline.hincrby('hash:stats', 'size', JSON.stringify(doc).length);
} }
} else {
const results = await pipeline.exec(); // No duplicate checking - index everything
for (const item of batchWithHashes) {
const errorCount = results?.filter(([err]) => err !== null).length || 0; bulkOperations.push({ index: { _index: INDEX_NAME } });
bulkOperations.push(item.hashes);
if (errorCount > 0) {
state.errors += errorCount;
sessionErrors += errorCount;
const successCount = toIndex.length - errorCount;
state.indexed += successCount;
sessionIndexed += successCount;
} else {
state.indexed += toIndex.length;
sessionIndexed += toIndex.length;
} }
} }
state.lastUpdate = new Date().toISOString(); // Execute bulk operation only if there are new items to insert
if (bulkOperations.length > 0) {
const progress = ((state.lastProcessedLine / state.totalLines) * 100).toFixed(1); try {
process.stdout.write( const bulkResponse = await client.bulk({
`\r⏳ Progress: ${state.lastProcessedLine}/${state.totalLines} (${progress}%) - ` + operations: bulkOperations,
`Indexed: ${sessionIndexed}, Skipped: ${sessionSkipped}, Errors: ${sessionErrors} ` refresh: false
); });
saveState(stateFile, state); if (bulkResponse.errors) {
const errorCount = bulkResponse.items.filter((item: any) => item.index?.error).length;
state.errors += errorCount;
sessionErrors += errorCount;
const successCount = (bulkOperations.length / 2) - errorCount;
state.indexed += successCount;
sessionIndexed += successCount;
} else {
const count = bulkOperations.length / 2;
state.indexed += count;
sessionIndexed += count;
}
} catch (error) {
console.error(`\n❌ Error processing batch:`, error);
const count = bulkOperations.length / 2;
state.errors += count;
sessionErrors += count;
}
}
// Update state
state.lastProcessedLine = lineNumber;
state.totalLines = lineNumber;
// Save state periodically (every 10 batches)
if (lineNumber % (batchSize * 10) === 0) {
saveState(stateFile, state);
}
// Progress indicator
const elapsed = ((Date.now() - sessionStartTime) / 1000).toFixed(0);
process.stdout.write(`\r⏳ Line: ${lineNumber} | Session: +${sessionIndexed} indexed, +${sessionSkipped} skipped | Total: ${state.indexed} indexed | Time: ${elapsed}s`);
}; };
for await (const line of rl) { for await (const line of rl) {
if (isInterrupted) break;
currentLineNumber++; currentLineNumber++;
// Skip already processed lines
if (currentLineNumber <= resumingFrom) { if (currentLineNumber <= resumingFrom) {
continue; continue;
} }
if (isInterrupted) break; const trimmedLine = line.trim();
if (trimmedLine.length > 0) {
// Only take first word (no spaces or separators)
const firstWord = trimmedLine.split(/\s+/)[0];
if (firstWord) {
currentBatch.push(firstWord);
const trimmed = line.trim(); if (currentBatch.length >= batchSize) {
if (!trimmed) continue; await processBatch(currentBatch, currentLineNumber);
currentBatch = [];
currentBatch.push(trimmed); }
state.lastProcessedLine = currentLineNumber; }
if (currentBatch.length >= batchSize) {
await processBatch(currentBatch);
currentBatch = [];
} }
} }
// Process remaining items in last batch
if (currentBatch.length > 0 && !isInterrupted) { if (currentBatch.length > 0 && !isInterrupted) {
await processBatch(currentBatch); await processBatch(currentBatch, currentLineNumber);
} }
console.log('\n'); if (isInterrupted) {
return;
if (!isInterrupted) {
const totalTime = ((Date.now() - sessionStartTime) / 1000).toFixed(2);
const rate = (sessionIndexed / parseFloat(totalTime)).toFixed(2);
console.log('━'.repeat(42));
console.log('✅ Indexing complete!');
console.log('');
console.log('📊 Session Statistics:');
console.log(` Indexed: ${sessionIndexed}`);
console.log(` Skipped: ${sessionSkipped}`);
console.log(` Errors: ${sessionErrors}`);
console.log(` Time: ${totalTime}s`);
console.log(` Rate: ${rate} docs/sec`);
console.log('');
console.log('📈 Total Statistics:');
console.log(` Total indexed: ${state.indexed}`);
console.log(` Total skipped: ${state.skipped}`);
console.log(` Total errors: ${state.errors}`);
console.log('');
deleteState(stateFile);
} }
await client.quit(); // Refresh index
console.log('\n\n🔄 Refreshing index...');
await client.indices.refresh({ index: INDEX_NAME });
// Delete state file on successful completion
deleteState(stateFile);
const duration = ((Date.now() - sessionStartTime) / 1000).toFixed(2);
const rate = sessionIndexed > 0 ? (sessionIndexed / parseFloat(duration)).toFixed(0) : '0';
console.log('\n━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━');
console.log('✅ Indexing complete!');
console.log(`━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━`);
console.log(`Total lines processed: ${currentLineNumber}`);
if (resumingFrom > 0) {
console.log(`Lines skipped (resumed): ${resumingFrom}`);
console.log(`Lines processed this session: ${currentLineNumber - resumingFrom}`);
}
console.log(`Successfully indexed (total): ${state.indexed}`);
console.log(`Successfully indexed (session): ${sessionIndexed}`);
console.log(`Skipped duplicates (total): ${state.skipped}`);
console.log(`Skipped duplicates (session): ${sessionSkipped}`);
console.log(`Errors (total): ${state.errors}`);
console.log(`Session duration: ${duration}s`);
console.log(`Session rate: ${rate} docs/sec`);
console.log('');
} catch (error) { } catch (error) {
console.error('\n\n❌ Error:', error); // Save state on error
saveState(stateFile, state); saveState(stateFile, state);
console.log(`💾 State saved to ${stateFile}`); console.error(`\n💾 State saved to ${stateFile}`);
await client.quit(); console.error('❌ Error:', error instanceof Error ? error.message : error);
process.exit(1); process.exit(1);
} finally {
// Remove signal handlers
process.removeListener('SIGINT', handleInterrupt);
process.removeListener('SIGTERM', handleInterrupt);
} }
} }
main(); // Parse command line arguments
const args = process.argv.slice(2);
const parsedArgs = parseArgs(args);
if (parsedArgs.showHelp || !parsedArgs.filePath) {
showHelp();
}
const filePath = parsedArgs.filePath as string;
// Validate file exists
if (!existsSync(filePath)) {
console.error(`❌ File not found: ${filePath}`);
process.exit(1);
}
console.log(`\n🔧 Configuration:`);
console.log(` File: ${filePath}`);
console.log(` Batch size: ${parsedArgs.batchSize}`);
console.log(` Resume: ${parsedArgs.resume}`);
console.log(` Check duplicates: ${parsedArgs.checkDuplicates}`);
if (parsedArgs.stateFile) {
console.log(` State file: ${parsedArgs.stateFile}`);
}
console.log('');
indexFile(filePath, parsedArgs.batchSize, parsedArgs.resume, parsedArgs.checkDuplicates, parsedArgs.stateFile).catch(console.error);

Ver fichero

@@ -3,7 +3,7 @@
/** /**
* Hasher Duplicate Remover Script * Hasher Duplicate Remover Script
* *
* This script finds and removes duplicate entries from Redis. * This script finds and removes duplicate entries from the Elasticsearch index.
* It identifies duplicates by checking plaintext, md5, sha1, sha256, and sha512 fields. * It identifies duplicates by checking plaintext, md5, sha1, sha256, and sha512 fields.
* *
* Usage: * Usage:
@@ -13,28 +13,17 @@
* Options: * Options:
* --dry-run Show duplicates without removing them (default) * --dry-run Show duplicates without removing them (default)
* --execute Actually remove the duplicates * --execute Actually remove the duplicates
* --batch-size=<number> Number of keys to scan in each batch (default: 1000) * --batch-size=<number> Number of items to process in each batch (default: 1000)
* --field=<field> Check duplicates only on this field (md5, sha1, sha256, sha512) * --field=<field> Check duplicates only on this field (plaintext, md5, sha1, sha256, sha512)
* --help, -h Show this help message * --help, -h Show this help message
*/ */
import Redis from 'ioredis'; import { Client } from '@elastic/elasticsearch';
const REDIS_HOST = process.env.REDIS_HOST || 'localhost'; const ELASTICSEARCH_NODE = process.env.ELASTICSEARCH_NODE || 'http://localhost:9200';
const REDIS_PORT = parseInt(process.env.REDIS_PORT || '6379', 10); const INDEX_NAME = 'hasher';
const REDIS_PASSWORD = process.env.REDIS_PASSWORD || undefined;
const REDIS_DB = parseInt(process.env.REDIS_DB || '0', 10);
const DEFAULT_BATCH_SIZE = 1000; const DEFAULT_BATCH_SIZE = 1000;
interface HashDocument {
plaintext: string;
md5: string;
sha1: string;
sha256: string;
sha512: string;
created_at: string;
}
interface ParsedArgs { interface ParsedArgs {
dryRun: boolean; dryRun: boolean;
batchSize: number; batchSize: number;
@@ -45,9 +34,9 @@ interface ParsedArgs {
interface DuplicateGroup { interface DuplicateGroup {
value: string; value: string;
field: string; field: string;
plaintexts: string[]; documentIds: string[];
keepPlaintext: string; keepId: string;
deletePlaintexts: string[]; deleteIds: string[];
} }
function parseArgs(args: string[]): ParsedArgs { function parseArgs(args: string[]): ParsedArgs {
@@ -107,244 +96,401 @@ Usage:
Options: Options:
--dry-run Show duplicates without removing them (default) --dry-run Show duplicates without removing them (default)
--execute Actually remove the duplicates --execute Actually remove the duplicates
--batch-size=<number> Number of keys to scan in each batch (default: 1000) --batch-size=<number> Number of items to process in each batch (default: 1000)
--field=<field> Check duplicates only on this field --field=<field> Check duplicates only on this field
Valid fields: md5, sha1, sha256, sha512 Valid fields: plaintext, md5, sha1, sha256, sha512
--help, -h Show this help message --help, -h Show this help message
Environment Variables: Environment Variables:
REDIS_HOST Redis host (default: localhost) ELASTICSEARCH_NODE Elasticsearch node URL (default: http://localhost:9200)
REDIS_PORT Redis port (default: 6379)
REDIS_PASSWORD Redis password (optional)
REDIS_DB Redis database number (default: 0)
Examples: Examples:
# Dry run (show duplicates only) npx tsx scripts/remove-duplicates.ts # Dry run, show all duplicates
npm run remove-duplicates npx tsx scripts/remove-duplicates.ts --execute # Remove all duplicates
npx tsx scripts/remove-duplicates.ts --field=md5 # Check only md5 duplicates
npx tsx scripts/remove-duplicates.ts --execute --field=plaintext
# Actually remove duplicates Notes:
npm run remove-duplicates -- --execute - The script keeps the OLDEST document (by created_at) and removes newer duplicates
- Always run with --dry-run first to review what will be deleted
# Check only MD5 duplicates - Duplicates are checked across all hash fields by default
npm run remove-duplicates -- --field=md5 --execute
Description:
This script scans through all hash documents in Redis and identifies
duplicates based on hash values. When duplicates are found, it keeps
the oldest entry (by created_at) and marks the rest for deletion.
`); `);
process.exit(0);
} }
async function findDuplicatesForField( async function findDuplicatesForField(
client: Redis, client: Client,
field: 'md5' | 'sha1' | 'sha256' | 'sha512', field: string,
batchSize: number batchSize: number
): Promise<DuplicateGroup[]> { ): Promise<DuplicateGroup[]> {
const pattern = `hash:index:${field}:*`; const duplicates: DuplicateGroup[] = [];
const hashToPlaintexts: Map<string, string[]> = new Map();
// Use aggregation to find duplicate values
console.log(`🔍 Scanning ${field} indexes...`); const fieldToAggregate = field === 'plaintext' ? 'plaintext.keyword' : field;
let cursor = '0'; // Use composite aggregation to handle large number of duplicates
let keysScanned = 0; let afterKey: any = undefined;
let hasMore = true;
do {
const [nextCursor, keys] = await client.scan(cursor, 'MATCH', pattern, 'COUNT', batchSize); console.log(` Scanning for duplicates...`);
cursor = nextCursor;
keysScanned += keys.length; while (hasMore) {
const aggQuery: any = {
for (const key of keys) { index: INDEX_NAME,
const hash = key.replace(`hash:index:${field}:`, ''); size: 0,
const plaintext = await client.get(key); aggs: {
duplicates: {
if (plaintext) { composite: {
if (!hashToPlaintexts.has(hash)) { size: batchSize,
hashToPlaintexts.set(hash, []); sources: [
{ value: { terms: { field: fieldToAggregate } } }
],
...(afterKey && { after: afterKey })
},
aggs: {
doc_count_filter: {
bucket_selector: {
buckets_path: { count: '_count' },
script: 'params.count > 1'
}
}
}
}
}
};
const response = await client.search(aggQuery);
const compositeAgg = response.aggregations?.duplicates as any;
const buckets = compositeAgg?.buckets || [];
for (const bucket of buckets) {
if (bucket.doc_count > 1) {
const value = bucket.key.value;
// Use scroll API for large result sets
const documentIds: string[] = [];
let scrollResponse = await client.search({
index: INDEX_NAME,
scroll: '1m',
size: 1000,
query: {
term: {
[fieldToAggregate]: value
}
},
sort: [
{ created_at: { order: 'asc' } }
],
_source: false
});
while (scrollResponse.hits.hits.length > 0) {
documentIds.push(...scrollResponse.hits.hits.map((hit: any) => hit._id));
if (!scrollResponse._scroll_id) break;
scrollResponse = await client.scroll({
scroll_id: scrollResponse._scroll_id,
scroll: '1m'
});
}
// Clear scroll
if (scrollResponse._scroll_id) {
await client.clearScroll({ scroll_id: scrollResponse._scroll_id }).catch(() => {});
}
if (documentIds.length > 1) {
duplicates.push({
value: String(value),
field,
documentIds,
keepId: documentIds[0], // Keep the oldest
deleteIds: documentIds.slice(1) // Delete the rest
});
} }
hashToPlaintexts.get(hash)!.push(plaintext);
} }
} }
process.stdout.write(`\r Keys scanned: ${keysScanned} `); // Check if there are more results
} while (cursor !== '0'); afterKey = compositeAgg?.after_key;
hasMore = buckets.length === batchSize && afterKey;
console.log('');
if (hasMore) {
const duplicates: DuplicateGroup[] = []; process.stdout.write(`\r Found ${duplicates.length} duplicate groups so far...`);
for (const [hash, plaintexts] of hashToPlaintexts.entries()) {
if (plaintexts.length > 1) {
// Fetch documents to get created_at timestamps
const docs = await Promise.all(
plaintexts.map(async (pt) => {
const data = await client.get(`hash:plaintext:${pt}`);
return data ? JSON.parse(data) as HashDocument : null;
})
);
const validDocs = docs.filter((doc): doc is HashDocument => doc !== null);
if (validDocs.length > 1) {
// Sort by created_at, keep oldest
validDocs.sort((a, b) => a.created_at.localeCompare(b.created_at));
duplicates.push({
value: hash,
field,
plaintexts: validDocs.map(d => d.plaintext),
keepPlaintext: validDocs[0].plaintext,
deletePlaintexts: validDocs.slice(1).map(d => d.plaintext)
});
}
} }
} }
return duplicates; return duplicates;
} }
async function removeDuplicates( /**
client: Redis, * Phase 1: Initialize and connect to Elasticsearch
duplicates: DuplicateGroup[], */
dryRun: boolean async function phase1_InitAndConnect() {
): Promise<{ deleted: number; errors: number }> { console.log(`🔍 Hasher Duplicate Remover - Phase 1: Initialization`);
console.log(`━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━`);
console.log(`Elasticsearch: ${ELASTICSEARCH_NODE}`);
console.log(`Index: ${INDEX_NAME}`);
console.log('');
const client = new Client({ node: ELASTICSEARCH_NODE });
console.log('🔗 Connecting to Elasticsearch...');
await client.cluster.health({});
console.log('✅ Connected successfully\n');
const countResponse = await client.count({ index: INDEX_NAME });
console.log(`📊 Total documents in index: ${countResponse.count}\n`);
return { client, totalDocuments: countResponse.count };
}
/**
* Phase 2: Find duplicates for a specific field
*/
async function phase2_FindDuplicatesForField(
client: Client,
field: string,
batchSize: number,
seenDeleteIds: Set<string>
): Promise<{ duplicates: DuplicateGroup[], totalFound: number }> {
console.log(`\n🔍 Phase 2: Checking duplicates for field: ${field}...`);
const fieldDuplicates = await findDuplicatesForField(client, field, batchSize);
const duplicates: DuplicateGroup[] = [];
// Filter out already seen delete IDs to avoid counting the same document multiple times
for (const dup of fieldDuplicates) {
const newDeleteIds = dup.deleteIds.filter(id => !seenDeleteIds.has(id));
if (newDeleteIds.length > 0) {
dup.deleteIds = newDeleteIds;
newDeleteIds.forEach(id => seenDeleteIds.add(id));
duplicates.push(dup);
}
}
console.log(` Found ${fieldDuplicates.length} duplicate groups for ${field}`);
console.log(` New unique documents to delete: ${duplicates.reduce((sum, dup) => sum + dup.deleteIds.length, 0)}`);
// Force garbage collection if available
if (global.gc) {
global.gc();
console.log(` ♻️ Memory freed after processing ${field}`);
}
return { duplicates, totalFound: fieldDuplicates.length };
}
/**
* Phase 3: Process deletion for a batch of duplicates
*/
async function phase3_DeleteBatch(
client: Client,
deleteIds: string[],
batchSize: number,
startIndex: number
): Promise<{ deleted: number, errors: number }> {
const batch = deleteIds.slice(startIndex, startIndex + batchSize);
let deleted = 0; let deleted = 0;
let errors = 0; let errors = 0;
console.log(''); try {
console.log(`${dryRun ? '🔍 DRY RUN - Would delete:' : '🗑️ Deleting duplicates...'}`); const bulkOperations = batch.flatMap(id => [
console.log(''); { delete: { _index: INDEX_NAME, _id: id } }
]);
for (const dup of duplicates) { const bulkResponse = await client.bulk({
console.log(`Duplicate ${dup.field}: ${dup.value}`); operations: bulkOperations,
console.log(` Keep: ${dup.keepPlaintext} (oldest)`); refresh: false
console.log(` Delete: ${dup.deletePlaintexts.join(', ')}`); });
if (!dryRun) { if (bulkResponse.errors) {
for (const plaintext of dup.deletePlaintexts) { const errorCount = bulkResponse.items.filter((item: any) => item.delete?.error).length;
try { errors += errorCount;
const docKey = `hash:plaintext:${plaintext}`; deleted += batch.length - errorCount;
const docData = await client.get(docKey); } else {
deleted += batch.length;
if (docData) {
const doc: HashDocument = JSON.parse(docData);
const pipeline = client.pipeline();
// Delete the main document
pipeline.del(docKey);
// Delete all indexes
pipeline.del(`hash:index:md5:${doc.md5}`);
pipeline.del(`hash:index:sha1:${doc.sha1}`);
pipeline.del(`hash:index:sha256:${doc.sha256}`);
pipeline.del(`hash:index:sha512:${doc.sha512}`);
// Update statistics
pipeline.hincrby('hash:stats', 'count', -1);
pipeline.hincrby('hash:stats', 'size', -JSON.stringify(doc).length);
const results = await pipeline.exec();
if (results && results.some(([err]) => err !== null)) {
errors++;
} else {
deleted++;
}
}
} catch (error) {
console.error(` Error deleting ${plaintext}:`, error);
errors++;
}
}
} }
console.log(''); } catch (error) {
console.error(`\n❌ Error deleting batch:`, error);
errors += batch.length;
}
// Force garbage collection if available
if (global.gc) {
global.gc();
} }
return { deleted, errors }; return { deleted, errors };
} }
async function main() { /**
const args = process.argv.slice(2); * Phase 4: Finalize and report results
const parsed = parseArgs(args); */
async function phase4_Finalize(
client: Client,
totalDeleted: number,
totalErrors: number,
initialDocumentCount: number
) {
console.log('\n\n🔄 Phase 4: Refreshing index...');
await client.indices.refresh({ index: INDEX_NAME });
if (parsed.showHelp) { const newCountResponse = await client.count({ index: INDEX_NAME });
showHelp();
process.exit(0);
}
const validFields: Array<'md5' | 'sha1' | 'sha256' | 'sha512'> = ['md5', 'sha1', 'sha256', 'sha512'];
const fieldsToCheck = parsed.field
? [parsed.field as 'md5' | 'sha1' | 'sha256' | 'sha512']
: validFields;
// Validate field
if (parsed.field && !validFields.includes(parsed.field as any)) {
console.error(`❌ Invalid field: ${parsed.field}`);
console.error(` Valid fields: ${validFields.join(', ')}`);
process.exit(1);
}
const client = new Redis({
host: REDIS_HOST,
port: REDIS_PORT,
password: REDIS_PASSWORD,
db: REDIS_DB,
});
console.log('\n━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━');
console.log('✅ Duplicate removal complete!');
console.log(`━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━`);
console.log(`Documents deleted: ${totalDeleted}`);
console.log(`Errors: ${totalErrors}`);
console.log(`Previous document count: ${initialDocumentCount}`);
console.log(`New document count: ${newCountResponse.count}`);
console.log(''); console.log('');
console.log('🔍 Hasher Duplicate Remover'); }
console.log('━'.repeat(42));
console.log(`Redis: ${REDIS_HOST}:${REDIS_PORT}`); async function removeDuplicates(parsedArgs: ParsedArgs) {
console.log(`Mode: ${parsed.dryRun ? 'DRY RUN' : 'EXECUTE'}`); const fields = parsedArgs.field
console.log(`Batch size: ${parsed.batchSize}`); ? [parsedArgs.field]
console.log(`Fields to check: ${fieldsToCheck.join(', ')}`); : ['plaintext', 'md5', 'sha1', 'sha256', 'sha512'];
console.log(`Mode: ${parsedArgs.dryRun ? '🔎 DRY RUN (no changes)' : '⚠️ EXECUTE (will delete)'}`);
console.log(`Batch size: ${parsedArgs.batchSize}`);
console.log(`Fields to check: ${fields.join(', ')}`);
console.log(''); console.log('');
try { try {
console.log('🔗 Connecting to Redis...'); // === PHASE 1: Initialize ===
await client.ping(); const { client, totalDocuments } = await phase1_InitAndConnect();
console.log('✅ Connected successfully\n');
// Force garbage collection after phase 1
const allDuplicates: DuplicateGroup[] = []; if (global.gc) {
global.gc();
for (const field of fieldsToCheck) { console.log('♻️ Memory freed after initialization\n');
const duplicates = await findDuplicatesForField(client, field, parsed.batchSize);
allDuplicates.push(...duplicates);
console.log(` Found ${duplicates.length} duplicate groups for ${field}`);
} }
console.log(''); // === PHASE 2: Find duplicates field by field ===
console.log(`📊 Total duplicate groups found: ${allDuplicates.length}`); const allDuplicates: DuplicateGroup[] = [];
const seenDeleteIds = new Set<string>();
for (const field of fields) {
const { duplicates } = await phase2_FindDuplicatesForField(
client,
field,
parsedArgs.batchSize,
seenDeleteIds
);
allDuplicates.push(...duplicates);
// Clear field duplicates to free memory
duplicates.length = 0;
}
const totalToDelete = allDuplicates.reduce((sum, dup) => sum + dup.deleteIds.length, 0);
console.log(`\n━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━`);
console.log(`📋 Summary:`);
console.log(` Duplicate groups found: ${allDuplicates.length}`);
console.log(` Documents to delete: ${totalToDelete}`);
console.log(`━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\n`);
if (allDuplicates.length === 0) { if (allDuplicates.length === 0) {
console.log(' No duplicates found!'); console.log(' No duplicates found! Index is clean.\n');
} else { return;
const totalToDelete = allDuplicates.reduce(
(sum, dup) => sum + dup.deletePlaintexts.length,
0
);
console.log(` Total documents to delete: ${totalToDelete}`);
const { deleted, errors } = await removeDuplicates(client, allDuplicates, parsed.dryRun);
if (!parsed.dryRun) {
console.log('━'.repeat(42));
console.log('✅ Removal complete!');
console.log('');
console.log('📊 Statistics:');
console.log(` Deleted: ${deleted}`);
console.log(` Errors: ${errors}`);
} else {
console.log('━'.repeat(42));
console.log('💡 This was a dry run. Use --execute to actually remove duplicates.');
}
} }
await client.quit(); // Show sample of duplicates
console.log(`📝 Sample duplicates (showing first 10):\n`);
const samplesToShow = allDuplicates.slice(0, 10);
for (const dup of samplesToShow) {
const truncatedValue = dup.value.length > 50
? dup.value.substring(0, 50) + '...'
: dup.value;
console.log(` Field: ${dup.field}`);
console.log(` Value: ${truncatedValue}`);
console.log(` Keep: ${dup.keepId}`);
console.log(` Delete: ${dup.deleteIds.length} document(s)`);
console.log('');
}
if (allDuplicates.length > 10) {
console.log(` ... and ${allDuplicates.length - 10} more duplicate groups\n`);
}
if (parsedArgs.dryRun) {
console.log(`━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━`);
console.log(`🔎 DRY RUN - No changes made`);
console.log(` Run with --execute to remove ${totalToDelete} duplicate documents`);
console.log(`━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\n`);
return;
}
// === PHASE 3: Execute deletion in batches ===
console.log(`\n🗑 Phase 3: Removing ${totalToDelete} duplicate documents...\n`);
let totalDeleted = 0;
let totalErrors = 0;
const deleteIds = allDuplicates.flatMap(dup => dup.deleteIds);
// Clear allDuplicates to free memory
allDuplicates.length = 0;
// Delete in batches with memory management
for (let i = 0; i < deleteIds.length; i += parsedArgs.batchSize) {
const { deleted, errors } = await phase3_DeleteBatch(
client,
deleteIds,
parsedArgs.batchSize,
i
);
totalDeleted += deleted;
totalErrors += errors;
process.stdout.write(
`\r⏳ Progress: ${Math.min(i + parsedArgs.batchSize, deleteIds.length)}/${deleteIds.length} - ` +
`Deleted: ${totalDeleted}, Errors: ${totalErrors}`
);
}
// Clear deleteIds to free memory
deleteIds.length = 0;
seenDeleteIds.clear();
// === PHASE 4: Finalize ===
await phase4_Finalize(client, totalDeleted, totalErrors, totalDocuments);
} catch (error) { } catch (error) {
console.error('\n\n❌ Error:', error); console.error('\n❌ Error:', error instanceof Error ? error.message : error);
await client.quit();
process.exit(1); process.exit(1);
} }
} }
main(); // Parse command line arguments
const args = process.argv.slice(2);
const parsedArgs = parseArgs(args);
if (parsedArgs.showHelp) {
showHelp();
}
// Validate field if provided
const validFields = ['plaintext', 'md5', 'sha1', 'sha256', 'sha512'];
if (parsedArgs.field && !validFields.includes(parsedArgs.field)) {
console.error(`❌ Invalid field: ${parsedArgs.field}`);
console.error(` Valid fields: ${validFields.join(', ')}`);
process.exit(1);
}
console.log(`\n🔧 Configuration:`);
console.log(` Mode: ${parsedArgs.dryRun ? 'dry-run' : 'execute'}`);
console.log(` Batch size: ${parsedArgs.batchSize}`);
if (parsedArgs.field) {
console.log(` Field: ${parsedArgs.field}`);
} else {
console.log(` Fields: all (plaintext, md5, sha1, sha256, sha512)`);
}
console.log('');
removeDuplicates(parsedArgs).catch(console.error);