Regex to match in citation title

1

I'm trying to capture all the titles of citations in scientific articles, my regex looks like this:

  

(A-Za-za-a-a-ee-i-a-i-oo-a.) (0,1) (0.1) (0.1) (0.1) (0.1) (0.1) (0.1) -Za-z0-9: -aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa, }.) {0,1}

Some examples of citations, with the title in bold:

DI MAIO, P. The Missing Pragmatic Link in the Semantic Web . Business Intelligence Advisory Service Executive Update. v. 8, n. 7, 2008.

ECO, U. Reader in Fabula: interpretive cooperation in narrative text Barcelona: Lumen, 1987

ECO, U. The concept of text. São Paulo: T. A. Q. / EDUSP, 1984.

ECO, U. Open work: form and indeterminacy in contemporary poetics. São Paulo: Perspectiva, 1988.

ECO, U. The limits of interpretation. São Paulo: Pioneira, 2000.

EDMONDS, B. The Pragmatic Roots of Context . In: PROC. OF THE 2ND INTERNATIONAL AND INTERDISCIPLINARY CONFERENCE ON MODELING AND USING CONTEXT. Berlin; Heidelberg; New York, v. 1688, 1999. Annals ... v. 1688, p. 119-132, 1999.

BERNERS-LEE, T. Semantic Web Concepts. 2005a. Available at: link . Accessed on: 25 Sep. 2014

BERNERS-LEE, T. Web for real people . 2005b. Available in . Accessed on: 25 Sep. 2014.

BERNERS-LEE, T .; CAILLIAU, R. WorldWideWeb: Proposal for a HyperText Project. 1990. Available at: < link & gt ;. Accessed on: Oct 13. 2014.

BERNERS-LEE, T .; HENDLER, J .; LASSILA, O. The semantic web: a new form of web content that is meaningful to computers will unleash a revolution of new possibilities New York: Scientific American, 2001. Available in: link . Accessed on: Oct 13. 2014.

BLAIR, D. C. Information Retrieval and the Philosophy of Language Annual Review of Information Science and Technology, v. 37, pp. 3-50, Medford, 2003.

BLAIR, D. C. Wittgenstein, Language and Information: Back to the Rough Ground! Dordrecht: Springer, 2006.

BONFIM, M. E. Recovery of Text Documents Using an Extended Probabilistic Model Piracicaba: UNIMEP, 2006. 131 f. Dissertation (Master in Computer Science). Master in Computer Science. Metodista University of Piracicaba, 2006.

BORLUND, P. The Concept of Relevance in IR. Journal of the American Society for Information Science and Technology, v.54, p. 913-925, 2003.

BORST, W. N. Construction of engineering ontologies. Thesis (Doctorate in Information and Knowledge Systems). University of Tweenty - Center for Telematics and Information Technology, Enschede, Nederland, 1997.

BOUNDLESS. Boundless Psychology. 201X. Available in < link > Accessed on: 13 Aug. 2014.

BRATT, S. Semantic Web, and Other Technologies to Watch. 2008. Available at < link > Accessed on: 13 Aug. 2014.

BRÉAL, M. Semantics: studies in the science of meaning. New York: Henry Holt & Company, 1900.

BRICKLEY, D .; MILLER, L. FOAF Vocabulary Specification 0.9. 2007. Available in < link > Accessed on: 17 May 2015.

BRITISH LIBRARY. Sample Data. Available at. Accessed on: 12 Dec. 2014.

BRUYNE, P. de, HERMAN, J., SCHOUTHEETE, M. de. Dynamics of research in social sciences. Rio de Janeiro: Francisco Alves, 1977.

BUFREM, L. S, et al. Modeling practices for the socialization of information - the construction of knowledge in higher education. Perspectives in Information Science, Belo Horizonte, v.15, n.2, p.22-41, may / ago. 2010.

THESE ARE NOT ALL CASES, THE COMPLETE LIST OF QUOTES IS FOUND HERE:

For testing: link

    
asked by anonymous 03.04.2018 / 17:18

2 answers

2

Instead of using regex, I suggest breaking in array by . e espaço and picking the second index [1] that will be just the Title. See:

var strings = [
   "DI MAIO, P. The Missing Pragmatic Link in the Semantic Web. Business Intelligence Advisory Service Executive Update. v. 8, n. 7, 2008.",
   "ECO, U. Lector in Fabula: la cooperación interpretativa en el texto narrativo. Barcelona: Lumen, 1987",
   "ECO, U. O conceito de texto. São Paulo: T. A. Q. /EDUSP, 1984.",

   "ECO, U. Obra aberta: forma e indeterminação nas poéticas contemporâneas. São Paulo: Perspectiva, 1988.",
   "ECO, U. Os limites da interpretação. São Paulo: Pioneira, 2000.",
   "EDMONDS, B. The Pragmatic Roots of Context. In: PROC. OF THE 2ND INTERNATIONAL AND INTERDISCIPLINARY CONFERENCE ON MODELING AND USING CONTEXT. Berlin; Heidelberg; New York, v. 1688, 1999. Anais… v. 1688, p. 119-132, 1999.",
   "BERNERS-LEE, T. Semantic Web Concepts. 2005a. Disponível em: http://www.w3.org/2005/Talks/0517-boit-tbl. Acesso em: 25 set. 2014",
   "BERNERS-LEE, T. Web for real people. 2005b. Disponível em . Acesso em: 25 set. 2014.",
   "BERNERS-LEE, T.; CAILLIAU, R. WorldWideWeb: Proposal for a HyperText Project. 1990. Disponível em: < http://www.w3.org/Proposal.html >. Acesso em: 13 out. 2014.",
   "BERNERS-LEE, T.; HENDLER, J.; LASSILA, O. The semantic web: a new form of web content that is meaningful to computers will unleash a revolution of new possibilities. New York: Scientific American, 2001. Disponível em: http://www.sciam.com/2001/050lissue/0501berners-lee.html. Acesso em: 13 out. 2014."
]

for(var x=0; x<strings.length; x++){
   var titulo = strings[x].split(". ")[1];
   document.querySelector("#res").innerHTML += strings[x].replace(titulo,"<span style='color:blue;'>"+titulo+"</span>")+"<br><b style='color: red;'>Título -></b> <b>"+titulo+"</b><br><br>";
}
<div id="res"></div>
  

Considering that in the middle of the title there is no . e espaço .

The code would be this:

var string = "DI MAIO, P. The Missing Pragmatic Link in the Semantic Web. Business Intelligence Advisory Service Executive Update. v. 8, n. 7, 2008";

var titulo = string.split(". ")[1];
console.log(titulo);

Another way would be to manipulate strings:

var strings = [
   "DI MAIO, P. The Missing Pragmatic Link in the Semantic Web. Business Intelligence Advisory Service Executive Update. v. 8, n. 7, 2008.",
   "ECO, U. Lector in Fabula: la cooperación interpretativa en el texto narrativo. Barcelona: Lumen, 1987",
   "ECO, U. O conceito de texto. São Paulo: T. A. Q. /EDUSP, 1984.",
   "ECO, U. Obra aberta: forma e indeterminação nas poéticas contemporâneas. São Paulo: Perspectiva, 1988.",
   "ECO, U. Os limites da interpretação. São Paulo: Pioneira, 2000.",
   "EDMONDS, B. The Pragmatic Roots of Context. In: PROC. OF THE 2ND INTERNATIONAL AND INTERDISCIPLINARY CONFERENCE ON MODELING AND USING CONTEXT. Berlin; Heidelberg; New York, v. 1688, 1999. Anais… v. 1688, p. 119-132, 1999.",
   "BERNERS-LEE, T. Semantic Web Concepts. 2005a. Disponível em: http://www.w3.org/2005/Talks/0517-boit-tbl. Acesso em: 25 set. 2014",
   "BERNERS-LEE, T. Web for real people. 2005b. Disponível em . Acesso em: 25 set. 2014.",
   "BERNERS-LEE, T.; CAILLIAU, R. WorldWideWeb: Proposal for a HyperText Project. 1990. Disponível em: < http://www.w3.org/Proposal.html >. Acesso em: 13 out. 2014.",
   "BERNERS-LEE, T.; HENDLER, J.; LASSILA, O. The semantic web: a new form of web content that is meaningful to computers will unleash a revolution of new possibilities. New York: Scientific American, 2001. Disponível em: http://www.sciam.com/2001/050lissue/0501berners-lee.html. Acesso em: 13 out. 2014."
]

for(var x=0; x<strings.length; x++){

   for(var y=0; y<strings[x].length; y++){

      var letra = strings[x][y];
      
      if(letra.match(/[a-z]/)){
         var titIni = y-(strings[x][y-1] == " " ? 2 : 1);
         break;
      }
   }

var titulo = strings[x].substring(titIni,strings[x].indexOf(". ", titIni));
document.querySelector("#res").innerHTML += strings[x].replace(titulo,"<span style='color:blue;'>"+titulo+"</span>")+"<br><b style='color: red;'>Título -></b> <b>"+titulo+"</b><br><br>";

}
<div id="res"></div>
  

Considering also that in the middle of the title there is not. and space.

Code:

var string = "ECO, U. O conceito de texto. São Paulo: T. A. Q. /EDUSP, 1984.";

   for(var x=0; x<string.length; x++){

      var letra = string[x];
      
      if(letra.match(/[a-z]/)){
         var titIni = x-(string[x-1] == " " ? 2 : 1);
         break;
      }
   }

var titulo = string.substring(titIni,string.indexOf(". ", titIni));
console.log(titulo);
    
03.04.2018 / 19:42
1

The logic I thought was this:

  • Authors
  • Title
  • Descriptions

So you can define the following rules:

  

Authors : surname + comma + space + name + dot = BERNERS-LEE, T. ; Multiple authors are separated by ; the last author ends with point . always.

     

Title : Anything that does not have ; , . in the middle, but should end with .

     

Description : anything that comes after the title.

REGEX

^((?:.+?, .+?;)*?(?:[^;\s]+?, .+?)\.)([^;]+?\.).*$

Explanation

  • (?:.+?, .+?;)*?(?:[^;\s]+?, .+?)\.) - get the authors
    • (?:.+?, .+?;)*? - takes care of multi authors that always terminal with ;
    • (?:[^;\s]+?, .+?)\.) - takes the last author, who will never have ; and ends in .
  • ([^;]+?\.) - Get Title ending with .
  • .*$ - Description, go to the end.

See Regex101

    
03.04.2018 / 19:48