Building a Node.js REST API to Scrape Metadata from URLs with Metadata-Scraper

6 min readOct 12, 2023

Meta Tag Scrapper Poster — Scrap meta tag with nodejs

In today’s digital world, information is more accessible than ever before, and web scraping has become a valuable tool for extracting data from websites. In this article, we’ll explore how to create a Node.js REST API to scrape metadata from URLs using the metadata-scraper npm package. This versatile API will allow you to fetch metadata from web pages, making it a handy tool for various applications, such as building link previews or content curation services.

Prerequisites

Before we dive into the code, make sure you have Node.js and npm installed on your machine. You can download them from the official website (https://nodejs.org/). Additionally, you should be familiar with JavaScript and have a basic understanding of REST APIs.

TECH STACK

Our tech stack consists of Node.js and the url-metadata npm package. This combination allows us to create a versatile and efficient web scraping API.

I will be using VS Code Editor so go ahead and initialize your nodejs project by running this command

npm init -y

Now install these dependencies in order to work smoothly , Just install the following dependencies from the terminal

npm i express express-async-errors express-validator url-metadata mongoose morgan

Here is a little intro about all these dependencies

Express: A framework for building web applications and APIs in Node.js. It simplifies the process of handling web requests and responses.
Express-async-errors: A middleware that makes it easier to handle errors in asynchronous (non-blocking) code within an Express application.
Express-validator: A package for validating and sanitizing user input in an Express application, helping to ensure data is safe and valid.
Metadata-scraper: Likely a package used to extract metadata from websites, such as titles, descriptions, or images, often used in web scraping or content sharing applications.
Mongoose: A popular Node.js package for working with MongoDB, a NoSQL database. Mongoose simplifies database operations and provides a structured way to interact with MongoDB.
Morgan: A middleware for Express that helps log HTTP requests and responses, making it useful for debugging and monitoring your application’s activity

Setting Up the Project

Here’s the structure of our project:

- app.js
- routes/metaTag.js
- middleware/errorhandler.js
- db/index.js
- controller/scrapMetaData.js

1.app.js

The app.js file serves as the entry point for our application. It sets up the Express server, includes necessary middleware, and defines a basic root route for API documentation.

// Import necessary packages and modules
const express = require('express'); // Import the Express framework
const morgan = require('morgan'); // Import the Morgan middleware for logging
const { errorHandler } = require('./middleware/errorhandler'); // Import a custom error handling middleware
const cors = require('cors'); // Import the CORS middleware for handling Cross-Origin Resource Sharing
require('express-async-errors'); // Import a package to simplify async error handling
require('./db'); // Import database setup code (presumably for MongoDB)
const app = express(); // Create an Express application

// Set up middleware
app.use(express.json()); // Parse JSON request bodies
app.use(cors()); // Enable CORS to allow cross-origin requests
app.use(morgan('dev')); // Log HTTP requests using Morgan in 'dev' format

const PORT = 8000 || process.env.PORT; // Define the server's port (default 8000 or from environment variable)

// Define a root route for documentation
app.get('/', (req, res) => {
  res.send(`
    <h1>Meta Web Scraping API</h1>
    <p>Use /api/scrap</p>
    <p>Author: ZiaCodes</p>
  `);
});

// Additional setup for routing
const megaScrapper = require('./routes/metaTag'); // Import routing for the 'metaTag' path

// Define a route with the 'api' prefix
app.use('/api', megaScrapper);

// Implement async error handling using the custom 'errorHandler' middleware
app.use(errorHandler);

// Start the server and listen on the defined port
app.listen(PORT, () => {
  console.log(`🚀 Server is running at http://localhost:${PORT}`);
});

2.routes/metaTag.js

The routes/metaTag.js file is where we define the routing logic for our API, specifying endpoints for scraping metadata, deleting metadata, and retrieving metadata.

// Import necessary packages and modules
const express = require('express'); // Import the Express framework
const { getAllMetaData, scrapMetaData, deleteURLMetaData } = require('../controller/scrapMetaData'); // Import controller functions for handling routes
const router = express.Router(); // Create an Express router

// Define routes and associate them with controller functions
router.post('/scrap', scrapMetaData); // Handle POST requests to '/scrap' by calling the 'scrapMetaData' function
router.delete('/deleteMetaTag', deleteURLMetaData); // Handle DELETE requests to '/deleteMetaTag' by calling the 'deleteURLMetaData' function
router.get('/getmetadata', getAllMetaData); // Handle GET requests to '/getmetadata' by calling the 'getAllMetaData' function

module.exports = router; // Export the router to be used in the main application

3.middleware/errorhandler.js

The error handling middleware in middleware/errorhandler.js helps ensure that our API handles errors gracefully.

// Define and export an error handling middleware
exports.errorHandler = (err, req, res, next) => {
  console.log("err:", err); // Log the error to the console for debugging purposes
  res.status(500).json({ 
  error: err.message || err 
}); // Respond with a 500 (Internal Server Error) status and a JSON object containing the error message or the error itself
};

4.db/index.js

The db/index.js file is used to set up the database connection. In this case, it connects to a MongoDB database.

// Import the Mongoose library
const mongoose = require('mongoose');

// Connect to the MongoDB database using the provided connection string
mongoose.connect("mongodb+srv://your-connection-string-here", {
  useNewUrlParser: true, // Use the new URL parser
  useUnifiedTopology: true // Use the new Server Discovery and Monitoring engine
})
.then(() => {
  console.log("😍 Db Connected Successfully."); // Log a success message if the database connection is established
})
.catch((err) => {
  console.log("😓 Db failed to Connect", err); // Log an error message if there's an issue with the database connection
});

5.controller/scrapMetaData.js

The controller/scrapMetaData.js file contains the core functionality for web scraping. This is where we utilize the metadata-scraper npm package to extract metadata from URLs.

// Import necessary packages and modules
const urlMetadata = require("url-metadata"); // Package for fetching URL metadata
const ScarpData = require("../model/scarpData"); // Import the model for storing scraped data
const getMetaData = require('metadata-scraper'); // Package for scraping metadata from URLs

// Function to scrape and save metadata from a URL
exports.scrapMetaData = async (req, res) => {
  const { url } = req.body;

  console.log(url);

  async function run() {
    // Check if the URL already exists in the database
    const oldURL = await ScarpData.findOne({ url });
    if (oldURL)
      return res.json({ error: "This URL is already Added!" });

    // Fetch metadata from the provided URL
    const data = await getMetaData(url);

    // If no data is found, return an error
    if (!data) return res.json({ error: "No Data Found" });

    // Create a new document with the scraped data and save it to the database
    const metaTagData = ScarpData({
      url: data.url,
      title: data.title,
      author: data.author,
      description: data.description,
      poster: data.image,
    });

    await metaTagData.save();
    res.json({ data });
  }

  run();
}

// Function to delete metadata associated with a URL
exports.deleteURLMetaData = async (req, res) => {
  const { url } = req.body;
  if (!url) return res.json({ error: "URL input cannot be empty!" });

  // Find the metadata document by URL
  const UrlmetaData = await ScarpData.findOne({ url });

  // If the URL doesn't exist, return an error
  if (!UrlmetaData)
    return res.json({ error: "This URL doesn't exist in the database!" });

  // Delete the metadata document by its ID
  await ScarpData.findByIdAndDelete(UrlmetaData?._id);
  return res.json({ message: "Meta Data deleted Successfully!" });
}

// Function to get all saved metadata
exports.getAllMetaData = async (req, res) => {
  // Retrieve all metadata documents from the database
  const response = await ScarpData.find();
  res.json({ response });
}

n this code, scrapMetaData fetches and saves metadata from a URL, deleteURLMetaData deletes metadata associated with a URL, and getAllMetaData retrieves all saved metadata documents from the database. These functions are used to interact with a database and perform actions based on incoming HTTP requests.

Now test it locally using tools like postman or use render to deploy your web services.

Remember to use web scraping responsibly and respect the terms of use of the websites you scrape. Additionally, you can enhance this API with features such as caching, rate limiting, and advanced error handling. Happy web scraping!

Building a Node.js REST API to Scrape Metadata from URLs with Metadata-Scraper

Prerequisites

TECH STACK

Setting Up the Project

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Syed Ziauddin

No responses yet